PutHiveStreaming

Description:

This processor uses Hive Streaming to send flow file data to an Apache Hive table. The incoming flow file is expected to be in Avro format and the table must exist in Hive. Please see the Hive documentation for requirements on the Hive table (format, partitions, etc.). The partition values are extracted from the Avro record based on the names of the partition columns as specified in the processor. NOTE: If multiple concurrent tasks are configured for this processor, only one table can be written to at any time by a single thread. Additional tasks intending to write to the same table will wait for the current task to finish writing to the table.

Tags:

hive, streaming, put, database, store

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

NameDefault ValueAllowable ValuesDescription
Hive Metastore URIThe URI location for the Hive Metastore. Note that this is not the location of the Hive Server. The default port for the Hive metastore is 9043.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Hive Configuration ResourcesA file or comma separated list of files which contains the Hive configuration (hive-site.xml, e.g.). Without this, Hadoop will search the classpath for a 'hive-site.xml' file or will revert to a default configuration. Note that to enable authentication with Kerberos e.g., the appropriate properties must be set in the configuration files. Also note that if Max Concurrent Tasks is set to a number greater than one, the 'hcatalog.hive.client.cache.disabled' property will be forced to 'true' to avoid concurrency issues. Please see the Hive documentation for more details.
Supports Expression Language: true (will be evaluated using variable registry only)
Database NameThe name of the database in which to put the data.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Table NameThe name of the database table in which to put the data.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Partition ColumnsA comma-delimited list of column names on which the table has been partitioned. The order of values in this list must correspond exactly to the order of partition columns specified during the table creation.
Supports Expression Language: true (will be evaluated using variable registry only)
Auto-Create Partitionstrue
  • true
  • false
Flag indicating whether partitions should be automatically created
Max Open Connections8The maximum number of open connections that can be allocated from this pool at the same time, or negative for no limit.
Heartbeat Interval60Indicates that a heartbeat should be sent when the specified number of seconds has elapsed. A value of 0 indicates that no heartbeat should be sent. Note that although this property supports Expression Language, it will not be evaluated against incoming FlowFile attributes.
Supports Expression Language: true (will be evaluated using variable registry only)
Transactions per Batch100A hint to Hive Streaming indicating how many transactions the processor task will need. This value must be greater than 1.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Records per Transaction10000Number of records to process before committing the transaction. This value must be greater than 1.
Supports Expression Language: true (will be evaluated using flow file attributes and variable registry)
Call Timeout0The number of seconds allowed for a Hive Streaming operation to complete. A value of 0 indicates the processor should wait indefinitely on operations. Note that although this property supports Expression Language, it will not be evaluated against incoming FlowFile attributes.
Supports Expression Language: true (will be evaluated using variable registry only)
Rollback On Failurefalse
  • true
  • false
Specify how to handle error. By default (false), if an error occurs while processing a FlowFile, the FlowFile will be routed to 'failure' or 'retry' relationship based on error type, and processor can continue with next FlowFile. Instead, you may want to rollback currently processed FlowFiles and stop further processing immediately. In that case, you can do so by enabling this 'Rollback On Failure' property. If enabled, failed FlowFiles will stay in the input relationship without penalizing it and being processed repeatedly until it gets processed successfully or removed by other means. It is important to set adequate 'Yield Duration' to avoid retrying too frequently.NOTE: When an error occurred after a Hive streaming transaction which is derived from the same input FlowFile is already committed, (i.e. a FlowFile contains more records than 'Records per Transaction' and a failure occurred at the 2nd transaction or later) then the succeeded records will be transferred to 'success' relationship while the original input FlowFile stays in incoming queue. Duplicated records can be created for the succeeded ones when the same FlowFile is processed again.
Kerberos Credentials ServiceController Service API:
KerberosCredentialsService
Implementation: KeytabCredentialsService
Specifies the Kerberos Credentials Controller Service that should be used for authenticating with Kerberos
Kerberos PrincipalKerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Supports Expression Language: true (will be evaluated using variable registry only)
Kerberos KeytabKerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Supports Expression Language: true (will be evaluated using variable registry only)

Relationships:

NameDescription
retryThe incoming FlowFile is routed to this relationship if its records cannot be transmitted to Hive. Note that some records may have been processed successfully, they will be routed (as Avro flow files) to the success relationship. The combination of the retry, success, and failure relationships indicate how many records succeeded and/or failed. This can be used to provide a retry capability since full rollback is not possible.
successA FlowFile containing Avro records routed to this relationship after the record has been successfully transmitted to Hive.
failureA FlowFile containing Avro records routed to this relationship if the record could not be transmitted to Hive.

Reads Attributes:

None specified.

Writes Attributes:

NameDescription
hivestreaming.record.countThis attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the number of records from the incoming flow file written successfully and unsuccessfully, respectively.
query.output.tablesThis attribute is written on the flow files routed to the 'success' and 'failure' relationships, and contains the target table name in 'databaseName.tableName' format.

State management:

This component does not store state.

Restricted:

This component is not restricted.

System Resource Considerations:

None specified.