CalculateParquetOffsets

Description:

The processor generates N flow files from the input, and adds attributes with the offsets required to read the group of rows in the FlowFile's content. Can be used to increase the overall efficiency of processing extremely large Parquet files.

Tags:

parquet, split, partition, break apart, efficient processing, load balance, cluster

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Records Per SplitRecords Per SplitSpecifies how many records should be covered in each FlowFile
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Zero Content OutputZero Content Outputfalse
  • true
  • false
Whether to do, or do not copy the content of input FlowFile.

Relationships:

NameDescription
successFlowFiles, with special attributes that represent a chunk of the input file.

Reads Attributes:

NameDescription
record.offsetGets the index of first record in the input.
record.countGets the number of records in the input.
parquet.file.range.startOffsetGets the start offset of the selected row group in the parquet file.
parquet.file.range.endOffsetGets the end offset of the selected row group in the parquet file.

Writes Attributes:

NameDescription
record.offsetSets the index of first record of the parquet file.
record.countSets the number of records in the parquet file.

State management:

This component does not store state.

Restricted:

This component is not restricted.

Input requirement:

This component requires an incoming relationship.

System Resource Considerations:

None specified.