SelectHive3QL

Description:

Execute provided HiveQL SELECT query against a Hive database connection. Query result will be converted to Avro or CSV format. Streaming is used so arbitrarily large result sets are supported. This processor can be scheduled to run on a timer, or cron expression, using the standard scheduling methods, or it can be triggered by an incoming FlowFile. If it is triggered by an incoming FlowFile, then attributes of that FlowFile will be available when evaluating the select query. FlowFile attribute 'selecthiveql.row.count' indicates how many rows were selected.

Tags:

hive, sql, select, jdbc, query, database

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display NameAPI NameDefault ValueAllowable ValuesDescription
Hive Database Connection Pooling Servicehive3-dbcp-serviceController Service API:
Hive3DBCPService
Implementation: Hive3ConnectionPool
The Hive Controller Service that is used to obtain connection(s) to the Hive database
HiveQL Pre-Queryhive-pre-queryHiveQL pre-query to execute. Semicolon-delimited list of queries. Example: 'set tez.queue.name=queue1; set hive.exec.orc.split.strategy=ETL; set hive.exec.reducers.bytes.per.reducer=1073741824'. Note, the results/outputs of these queries will be suppressed if successfully executed.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
HiveQL Select Queryhive-queryHiveQL SELECT query to execute. If this is not set, the query is assumed to be in the content of an incoming FlowFile.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
HiveQL Post-Queryhive-post-queryHiveQL post-query to execute. Semicolon-delimited list of queries. Note, the results/outputs of these queries will be suppressed if successfully executed.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Fetch Sizehive-fetch-size0The number of result rows to be fetched from the result set at a time. This is a hint to the driver and may not be honored and/or exact. If the value specified is zero, then the hint is ignored.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Query timeouthive3-query-timeout0Sets the number of seconds the driver will wait for a query to execute. A value of 0 means no timeout. NOTE: Non-zero values may not be supported by the driver.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Max Rows Per Flow Filehive-max-rows0The maximum number of result rows that will be included in a single FlowFile. This will allow you to break up very large result sets into multiple FlowFiles. If the value specified is zero, then all rows are returned in a single FlowFile.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Maximum Number of Fragmentshive-max-frags0The maximum number of fragments. If the value specified is zero, then all fragments are returned. This prevents OutOfMemoryError when this processor ingests huge table.
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
Output Formathive-output-formatAvro
  • Avro
  • CSV
How to represent the records coming from Hive (Avro, CSV, e.g.)
Normalize Table/Column Nameshive-normalize-avrofalse
  • true
  • false
Whether to change non-Avro-compatible characters in column names to Avro-compatible characters. For example, colons and periods will be changed to underscores in order to build a valid Avro record.
Use Avro Logical Typesuse-logical-typesfalse
  • true
  • false
Whether to use Avro Logical Types for DECIMAL, DATE and TIMESTAMP columns. If disabled, written as string. If enabled, Logical types are used and written as its underlying type, specifically, DECIMAL as logical 'decimal': written as bytes with additional precision and scale meta data, DATE as logical 'date': written as int denoting days since Unix epoch (1970-01-01), and TIMESTAMP as logical 'timestamp-millis': written as long denoting milliseconds since Unix epoch. If a reader of written Avro records also knows these logical types, then these values can be deserialized with more context depending on reader implementation.
CSV Headercsv-headertrue
  • true
  • false
Include Header in Output
Alternate CSV Headercsv-alt-headerComma separated list of header fields
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
CSV Delimitercsv-delimiter,CSV Delimiter used to separate fields
Supports Expression Language: true (will be evaluated using flow file attributes and Environment variables)
CSV Quotecsv-quotetrue
  • true
  • false
Whether to force quoting of CSV fields. Note that this might conflict with the setting for CSV Escape.
CSV Escapecsv-escapetrue
  • true
  • false
Whether to escape CSV strings in output. Note that this might conflict with the setting for CSV Quote.
Character Sethive3-charsetUTF-8Specifies the character set of the record data.

Relationships:

NameDescription
successSuccessfully created FlowFile from HiveQL query result set.
failureHiveQL query execution failed. Incoming FlowFile will be penalized and routed to this relationship.

Reads Attributes:

None specified.

Writes Attributes:

NameDescription
mime.typeSets the MIME type for the outgoing flowfile to application/avro-binary for Avro or text/csv for CSV.
filenameAdds .avro or .csv to the filename attribute depending on which output format is selected.
selecthiveql.row.countIndicates how many rows were selected/returned by the query.
selecthiveql.query.durationCombined duration of the query execution time and fetch time in milliseconds. If 'Max Rows Per Flow File' is set, then this number will reflect only the fetch time for the rows in the Flow File instead of the entire result set.
selecthiveql.query.executiontimeDuration of the query execution time in milliseconds. This number will reflect the query execution time regardless of the 'Max Rows Per Flow File' setting.
selecthiveql.query.fetchtimeDuration of the result set fetch time in milliseconds. If 'Max Rows Per Flow File' is set, then this number will reflect only the fetch time for the rows in the Flow File instead of the entire result set.
fragment.identifierIf 'Max Rows Per Flow File' is set then all FlowFiles from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.
fragment.countIf 'Max Rows Per Flow File' is set then this is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet.
fragment.indexIf 'Max Rows Per Flow File' is set then the position of this FlowFile in the list of outgoing FlowFiles that were all derived from the same result set FlowFile. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same query result set and in what order FlowFiles were produced
query.input.tablesContains input table names in comma delimited 'databaseName.tableName' format.

State management:

This component does not store state.

Restricted:

This component is not restricted.

Input requirement:

This component allows an incoming relationship.

System Resource Considerations:

None specified.