MergeRecord

Description:

This Processor merges together multiple record-oriented FlowFiles into a single FlowFile that contains all of the Records of the input FlowFiles. This Processor works by creating 'bins' and then adding FlowFiles to these bins until they are full. Once a bin is full, all of the FlowFiles will be combined into a single output FlowFile, and that FlowFile will be routed to the 'merged' Relationship. A bin will consist of potentially many 'like FlowFiles'. In order for two FlowFiles to be considered 'like FlowFiles', they must have the same Schema (as identified by the Record Reader) and, if the <Correlation Attribute Name> property is set, the same value for the specified attribute. See Processor Usage and Additional Details for more information. NOTE: this processor should NOT be configured with Cron Driven for the Scheduling Strategy.

Additional Details...

Tags:

merge, record, content, correlation, stream, event

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Record Readerrecord-readerController Service API:
RecordReaderFactory
Implementations: CSVReader
JsonPathReader
AvroReader
CEFReader
Syslog5424Reader
JsonTreeReader
WindowsEventLogReader
XMLReader
SyslogReader
JASN1Reader
ReaderLookup
ParquetReader
GrokReader
ScriptedReader
YamlTreeReader
ExcelReader
Specifies the Controller Service to use for reading incoming data
Record Writerrecord-writerController Service API:
RecordSetWriterFactory
Implementations: XMLRecordSetWriter
FreeFormTextRecordSetWriter
AvroRecordSetWriter
ScriptedRecordSetWriter
JsonRecordSetWriter
ParquetRecordSetWriter
RecordSetWriterLookup
CSVRecordSetWriter
Specifies the Controller Service to use for writing out the records
Merge Strategymerge-strategyBin-Packing Algorithm
  • Bin-Packing Algorithm Generates 'bins' of FlowFiles and fills each bin as full as possible. FlowFiles are placed into a bin based on their size and optionally their attributes (if the <Correlation Attribute> property is set)
  • Defragment Combines fragments that are associated by attributes back into a single cohesive FlowFile. If using this strategy, all FlowFiles must have the attributes <fragment.identifier> and <fragment.count>. All FlowFiles with the same value for "fragment.identifier" will be grouped together. All FlowFiles in this group must have the same value for the "fragment.count" attribute. The ordering of the Records that are output is not guaranteed.
Specifies the algorithm used to merge records. The 'Defragment' algorithm combines fragments that are associated by attributes back into a single cohesive FlowFile. The 'Bin-Packing Algorithm' generates a FlowFile populated by arbitrarily chosen FlowFiles
Correlation Attribute Namecorrelation-attribute-nameIf specified, two FlowFiles will be binned together only if they have the same value for this Attribute. If not specified, FlowFiles are bundled by the order in which they are pulled from the queue.
Attribute StrategyAttribute StrategyKeep Only Common Attributes
  • Keep Only Common Attributes Any attribute that is not the same on all FlowFiles in a bin will be dropped. Those that are the same across all FlowFiles will be retained.
  • Keep All Unique Attributes Any attribute that has the same value for all FlowFiles in a bin, or has no value for a FlowFile, will be kept. For example, if a bin consists of 3 FlowFiles and 2 of them have a value of 'hello' for the 'greeting' attribute and the third FlowFile has no 'greeting' attribute then the outbound FlowFile will get a 'greeting' attribute with the value 'hello'.
Determines which FlowFile attributes should be added to the bundle. If 'Keep All Unique Attributes' is selected, any attribute on any FlowFile that gets bundled will be kept unless its value conflicts with the value from another FlowFile. If 'Keep Only Common Attributes' is selected, only the attributes that exist on all FlowFiles in the bundle, with the same value, will be preserved.
Minimum Number of Recordsmin-records1The minimum number of records to include in a bin
Supports Expression Language: true (will be evaluated using Environment variables only)
Maximum Number of Recordsmax-records1000The maximum number of Records to include in a bin. This is a 'soft limit' in that if a FlowFIle is added to a bin, all records in that FlowFile will be added, so this limit may be exceeded by up to the number of records in the last input FlowFile.
Supports Expression Language: true (will be evaluated using Environment variables only)
Minimum Bin Sizemin-bin-size0 BThe minimum size of for the bin
Maximum Bin Sizemax-bin-sizeThe maximum size for the bundle. If not specified, there is no maximum. This is a 'soft limit' in that if a FlowFile is added to a bin, all records in that FlowFile will be added, so this limit may be exceeded by up to the number of bytes in last input FlowFile.
Max Bin Agemax-bin-ageThe maximum age of a Bin that will trigger a Bin to be complete. Expected format is <duration> <time unit> where <duration> is a positive integer and time unit is one of seconds, minutes, hours
Maximum Number of Binsmax.bin.count10Specifies the maximum number of bins that can be held in memory at any one time. This number should not be smaller than the maximum number of concurrent threads for this Processor, or the bins that are created will often consist only of a single incoming FlowFile.

Relationships:

NameDescription
failureIf the bundle cannot be created, all FlowFiles that would have been used to created the bundle will be transferred to failure
originalThe FlowFiles that were used to create the bundle
mergedThe FlowFile containing the merged records

Reads Attributes:

NameDescription
fragment.identifierApplicable only if the <Merge Strategy> property is set to Defragment. All FlowFiles with the same value for this attribute will be bundled together.
fragment.countApplicable only if the <Merge Strategy> property is set to Defragment. This attribute must be present on all FlowFiles with the same value for the fragment.identifier attribute. All FlowFiles in the same bundle must have the same value for this attribute. The value of this attribute indicates how many FlowFiles should be expected in the given bundle.

Writes Attributes:

NameDescription
record.countThe merged FlowFile will have a 'record.count' attribute indicating the number of records that were written to the FlowFile.
mime.typeThe MIME Type indicated by the Record Writer
merge.countThe number of FlowFiles that were merged into this bundle
merge.bin.ageThe age of the bin, in milliseconds, when it was merged and output. Effectively this is the greatest amount of time that any FlowFile in this bundle remained waiting in this processor before it was output
merge.uuidUUID of the merged FlowFile that will be added to the original FlowFiles attributes
<Attributes from Record Writer>Any Attribute that the configured Record Writer returns will be added to the FlowFile.

State management:

This component does not store state.

Restricted:

This component is not restricted.

Input requirement:

This component requires an incoming relationship.

Example Use Cases:

Use Case:

Combine together many arbitrary Records in order to create a single, larger file

Configuration:

Configure the "Record Reader" to specify a Record Reader that is appropriate for the incoming data type.

Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output data type.

Set "Merge Strategy" to Bin-Packing Algorithm.

Set the "Minimum Bin Size" to desired file size of the merged output file. For example, a value of 1 MB will result in not merging data until at least

1 MB of data is available (unless the Max Bin Age is reached first). If there is no desired minimum file size, leave the default value of 0 B.

Set the "Minimum Number of Records" property to the minimum number of Records that should be included in the merged output file. For example, setting the value

to 10000 ensures that the output file will have at least 10,000 Records in it (unless the Max Bin Age is reached first).

Set the "Max Bin Age" to specify the maximum amount of time to hold data before merging. This can be thought of as a "timeout" at which time the Processor will

merge whatever data it is, even if the "Minimum Bin Size" and "Minimum Number of Records" has not been reached. It is always recommended to set the value.

A reasonable default might be 10 mins if there is no other latency requirement.

Connect the 'merged' Relationship to the next component in the flow. Auto-terminate the 'original' Relationship.



Example Use Cases Involving Other Components:

Use Case:

Combine together many Records that have the same value for a particular field in the data, in order to create a single, larger file

Keywords:

merge, combine, aggregate, like records, similar data

Components involved:

Component Type: org.apache.nifi.processors.standard.PartitionRecord

Configuration:

Configure the "Record Reader" to specify a Record Reader that is appropriate for the incoming data type.

Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output data type.

Add a single additional property. The name of the property should describe the field on which the data is being merged together.

The property's value should be a RecordPath that specifies which output FlowFile the Record belongs to.

For example, to merge together data that has the same value for the "productSku" field, add a property named productSku with a value of /productSku.

Connect the "success" Relationship to MergeRecord.

Auto-terminate the "original" Relationship.



Component Type: org.apache.nifi.processors.standard.MergeRecord

Configuration:

Configure the "Record Reader" to specify a Record Reader that is appropriate for the incoming data type.

Configure the "Record Writer" to specify a Record Writer that is appropriate for the desired output data type.

Set "Merge Strategy" to Bin-Packing Algorithm.

Set the "Minimum Bin Size" to desired file size of the merged output file. For example, a value of 1 MB will result in not merging data until at least

1 MB of data is available (unless the Max Bin Age is reached first). If there is no desired minimum file size, leave the default value of 0 B.

Set the "Minimum Number of Records" property to the minimum number of Records that should be included in the merged output file. For example, setting the value

to 10000 ensures that the output file will have at least 10,000 Records in it (unless the Max Bin Age is reached first).

Set the "Maximum Number of Records" property to a value at least as large as the "Minimum Number of Records." If there is no need to limit the maximum number of

records per file, this number can be set to a value that will never be reached such as 1000000000.

Set the "Max Bin Age" to specify the maximum amount of time to hold data before merging. This can be thought of as a "timeout" at which time the Processor will

merge whatever data it is, even if the "Minimum Bin Size" and "Minimum Number of Records" has not been reached. It is always recommended to set the value.

A reasonable default might be 10 mins if there is no other latency requirement.

Set the value of the "Correlation Attribute Name" property to the name of the property that you added in the PartitionRecord Processor. For example, if merging data

based on the "productSku" field, the property in PartitionRecord was named productSku so the value of the "Correlation Attribute Name" property should

be productSku.

Set the "Maximum Number of Bins" property to a value that is at least as large as the different number of values that will be present for the Correlation Attribute.

For example, if you expect 1,000 different SKUs, set this value to at least 1001. It is not advisable, though, to set the value above 10,000.

Connect the 'merged' Relationship to the next component in the flow.

Auto-terminate the 'original' Relationship.





System Resource Considerations:

None specified.

See Also:

MergeContent, SplitRecord, PartitionRecord