The group_by node

The group_by node is used to group a stream of data by the values of one or more fields.
Each group is then processed independently and concurrently to the other groups, for the rest of the chain (subgraph).

Note: The behaviour of using more than 1 group_by node within a flow is not defined.

See group_union for how to 'un-group' a dataflow.

Be aware of high group cardinality, as for every group, a number of processes (depends on the size of the grouped sub-flow) will be started in the dataflow engine. In other words, if you have a grouping that has high cardinality (many different values), more resources will be consumed.

Examples

  |group_by('fieldname1', 'fieldname2') 

Groups data along two dimensions: fieldname1 and fieldname2.

def group_field = 'data.code'
 |json_emitter()
 .every(700ms)
 .json(
     '{"code" : 224, "message": "this is a test", "mode": 1}',
     '{"code" : 334, "message": "this is another test", "mode": 1}',
     '{"code" : 114, "message": "this is another test", "mode": 2}',
     '{"code" : 443, "message": "this is another test", "mode": 1}',
     '{"code" : 224, "message": "this is another test", "mode": 2}',
     '{"code" : 111, "message": "this is another test", "mode": 1}',
     '{"code" : 551, "message": "this is another test", "mode": 2}'
 )
 .as('data')

|group_by(group_field)
|eval(
    lambda: str_replace("data.message", 'test', string("data.code"))
    )
    .as('data.message')
|debug()

In the above example data is grouped by the code field, so eventually the group_by node will start 6 computing sub-graphs to handle all groups. A computing sub-graph in this example will contain an eval node connected to a debug node.

Parameters

Parameter Description Default
[node] fields( string_list ) fieldnames to group by
lambda( lambda ) use a function to group the data-items (experimental) undefined