SampleRecord

SampleRecord 2.4.0

Bundle: org.apache.nifi | nifi-standard-nar
Description: Samples the records of a FlowFile based on a specified sampling strategy (such as Reservoir Sampling). The resulting FlowFile may be of a fixed number of records (in the case of reservoir-based algorithms) or some subset of the total number of records (in the case of probabilistic sampling), or a deterministic number of records (in the case of interval sampling).
Tags: interval, range, record, reservoir, sample
Input Requirement: REQUIRED
Supports Sensitive Dynamic Properties: false

Additional Details for SampleRecord 2.4.0
SampleRecord

This processor takes in a record set and samples records from the set according to the specified sampling strategy. The available sampling strategies are:
- Interval Sampling
  
  Select every _N_th record based on the value of the Sampling Interval property. For example, if there are 100 records in the set and the Sampling Interval is set to 4, there will be 25 records in the output, namely every 4th record. This performs uniform sampling of the record set so is best suited for record sets that are uniformly distributed. For example a record set representing user information that is uniformly distributed will result in the output records also being uniformly distributed. The outgoing record count is deterministic and is exactly the total number of records divided by the Sampling Interval value.
- Probabilistic Sampling
  
  Select each record with probability P, an integer percentage specified by the Sampling Probability value. For example, an incoming record set of 100 records with a Sampling Probability value of 20 should have roughly 20 records in the output. Use this when you want to output record sets of roughly the same size (but not exactly) and when you want each record to have the same “chance” to be selected for the output set. As another example, if you send the same flow file into the processor twice, a sampling strategy of Interval Sampling will always produce the same output, where Probabilistic Sampling may output different records (and a different total number of records).
- Reservoir Sampling
  
  Select K records from a record set having N total values, where K is the value of the Reservoir Size property and each record has an equal probability of being selected (exactly K / N). For example, an incoming record set of 100 records with a Reservoir Size value of 20 should have exactly 20 records in the output, randomly chosen from the input record set. Use this when you want to control the exact number of output records and have each input record have the same probability of being selected. As another example, if you send the same flow file into the processor twice, a sampling strategy of Interval Sampling will always produce the same output (same records and number of records), where Probabilistic Sampling may output different records (and a different total number of records), and Reservoir Sampling may output different records but the same total number of records. Note that the reservoir is kept in-memory, so if the size of the reservoir is very large, it may cause memory issues.
The “Random Seed” property applies to strategies/algorithms that use a pseudorandom random number generator, such as Probabilistic Sampling and Reservoir Sampling. The property is optional but if set will guarantee the same records in a flow file will be selected by the algorithm each time. This is useful for testing flows using non-deterministic algorithms such as Probabilistic Sampling and Reservoir Sampling.

Properties

Record Reader
Specifies the Controller Service to use for parsing incoming data and determining the data's schema

Display Name

Record Reader

Description

Specifies the Controller Service to use for parsing incoming data and determining the data's schema

API Name

record-reader

Service Interface

org.apache.nifi.serialization.RecordReaderFactory

Service Implementations

org.apache.nifi.avro.AvroReader

org.apache.nifi.cef.CEFReader

org.apache.nifi.csv.CSVReader

org.apache.nifi.excel.ExcelReader

org.apache.nifi.grok.GrokReader

org.apache.nifi.json.JsonPathReader

org.apache.nifi.json.JsonTreeReader

org.apache.nifi.services.protobuf.ProtobufReader

org.apache.nifi.lookup.ReaderLookup

org.apache.nifi.record.script.ScriptedReader

org.apache.nifi.syslog.Syslog5424Reader

org.apache.nifi.syslog.SyslogReader

org.apache.nifi.windowsevent.WindowsEventLogReader

org.apache.nifi.xml.XMLReader

org.apache.nifi.yaml.YamlTreeReader

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Record Writer
Specifies the Controller Service to use for writing results to a FlowFile

Display Name

Record Writer

Description

Specifies the Controller Service to use for writing results to a FlowFile

API Name

record-writer

Service Interface

org.apache.nifi.serialization.RecordSetWriterFactory

Service Implementations

org.apache.nifi.avro.AvroRecordSetWriter

org.apache.nifi.csv.CSVRecordSetWriter

org.apache.nifi.text.FreeFormTextRecordSetWriter

org.apache.nifi.json.JsonRecordSetWriter

org.apache.nifi.lookup.RecordSetWriterLookup

org.apache.nifi.record.script.ScriptedRecordSetWriter

org.apache.nifi.xml.XMLRecordSetWriter

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Sampling Interval
Specifies the number of records to skip before writing a record to the outgoing FlowFile. This property is only used if Sampling Strategy is set to Interval Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, a value of one (1) will cause all records to be included, and a value of two (2) will cause half the records to be included, and so on.
Display Name

Sampling Interval

Description

Specifies the number of records to skip before writing a record to the outgoing FlowFile. This property is only used if Sampling Strategy is set to Interval Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, a value of one (1) will cause all records to be included, and a value of two (2) will cause half the records to be included, and so on.

API Name

sample-record-interval

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

true

Dependencies
- Sampling Strategy is set to any of [interval]
Sampling Probability
Specifies the probability (as a percent from 0-100) of a record being included in the outgoing FlowFile. This property is only used if Sampling Strategy is set to Probabilistic Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, and a value of 100 will cause all records to be included in the outgoing FlowFile..
Display Name

Sampling Probability

Description

Specifies the probability (as a percent from 0-100) of a record being included in the outgoing FlowFile. This property is only used if Sampling Strategy is set to Probabilistic Sampling. A value of zero (0) will cause no records to be included in theoutgoing FlowFile, and a value of 100 will cause all records to be included in the outgoing FlowFile..

API Name

sample-record-probability

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

true

Dependencies
- Sampling Strategy is set to any of [probabilistic]
Random Seed
Specifies a particular number to use as the seed for the random number generator (used by probabilistic strategies). Setting this property will ensure the same records are selected even when using probabilistic strategies.
Display Name

Random Seed

Description

Specifies a particular number to use as the seed for the random number generator (used by probabilistic strategies). Setting this property will ensure the same records are selected even when using probabilistic strategies.

API Name

sample-record-random-seed

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

false

Dependencies
- Sampling Strategy is set to any of [probabilistic, reservoir]
Sampling Range
Specifies the range of records to include in the sample, from 1 to the total number of records. An example is '3,6-8,20-' which includes the third record, the sixth, seventh and eighth records, and all records from the twentieth record on. Commas separate intervals that don't overlap, and an interval can be between two numbers (i.e. 6-8) or up to a given number (i.e. -5), or from a number to the number of the last record (i.e. 20-). If this property is unset, all records will be included.
Display Name

Sampling Range

Description

Specifies the range of records to include in the sample, from 1 to the total number of records. An example is '3,6-8,20-' which includes the third record, the sixth, seventh and eighth records, and all records from the twentieth record on. Commas separate intervals that don't overlap, and an interval can be between two numbers (i.e. 6-8) or up to a given number (i.e. -5), or from a number to the number of the last record (i.e. 20-). If this property is unset, all records will be included.

API Name

sample-record-range

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

true

Dependencies
- Sampling Strategy is set to any of [range]
Reservoir Size
Specifies the number of records to write to the outgoing FlowFile. This property is only used if Sampling Strategy is set to reservoir-based strategies such as Reservoir Sampling.
Display Name

Reservoir Size

Description

Specifies the number of records to write to the outgoing FlowFile. This property is only used if Sampling Strategy is set to reservoir-based strategies such as Reservoir Sampling.

API Name

sample-record-reservoir

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

true

Dependencies
- Sampling Strategy is set to any of [reservoir]
Sampling Strategy
Specifies which method to use for sampling records from the incoming FlowFile
Display Name

Sampling Strategy

Description

Specifies which method to use for sampling records from the incoming FlowFile

API Name

sample-record-sampling-strategy

Default Value

reservoir

Allowable Values
- Interval Sampling
- Range Sampling
- Probabilistic Sampling
- Reservoir Sampling
Expression Language Scope

Not Supported

Sensitive

false

Required

true

System Resource Considerations

Resource	Description
MEMORY	An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.

Relationships

Name	Description
failure	If a FlowFile fails processing for any reason (for example, any record is not valid), the original FlowFile will be routed to this relationship
original	The original FlowFile is routed to this relationship if sampling is successful
success	The FlowFile is routed to this relationship if the sampling completed successfully

Writes Attributes

Name	Description
mime.type	The MIME type indicated by the record writer
record.count	The number of records in the resulting flow file