GenerateTableFetch 2.0.0

Bundle
org.apache.nifi | nifi-standard-nar
Description
Generates SQL select queries that fetch "pages" of rows from a table. The partition size property, along with the table's row count, determine the size and number of pages and generated FlowFiles. In addition, incremental fetching can be achieved by setting Maximum-Value Columns, which causes the processor to track the columns' maximum values, thus only fetching rows whose columns' values exceed the observed maximums. This processor is intended to be run on the Primary Node only. This processor can accept incoming connections; the behavior of the processor is different whether incoming connections are provided: - If no incoming connection(s) are specified, the processor will generate SQL queries on the specified processor schedule. Expression Language is supported for many fields, but no FlowFile attributes are available. However the properties will be evaluated using the Environment/System properties. - If incoming connection(s) are specified and no FlowFile is available to a processor task, no work will be performed. - If incoming connection(s) are specified and a FlowFile is available to a processor task, the FlowFile's attributes may be used in Expression Language for such fields as Table Name and others. However, the Max-Value Columns and Columns to Return fields must be empty or refer to columns that are available in each specified table.
Tags
database, fetch, generate, jdbc, query, select, sql
Input Requirement
ALLOWED
Supports Sensitive Dynamic Properties
false
  • Additional Details for GenerateTableFetch 2.0.0

    GenerateTableFetch

    GenerateTableFetch uses its properties and the specified database connection to generate FlowFiles containing SQL statements that can be used to fetch “pages” (aka “partitions”) of data from a table. GenerateTableFetch executes a query to the database to determine the current row count and maximum value, and if Maximum Value Columns are specified, will collect the count of rows whose values for the Maximum Value Columns are larger than those last observed by GenerateTableFetch. This allows for incremental fetching of “new” rows, rather than generating SQL to fetch the entire table each time. If no Maximum Value Columns are set, then the processor will generate SQL to fetch the entire table each time.

    In order to generate SQL that will fetch pages/partitions of data, by default GenerateTableFetch will generate SQL that orders the data based on the Maximum Value Columns (if present) and utilize the row numbers of the result set to determine each page. For example if the Maximum Value Column is an integer “id” and the partition size is 10, then the SQL for the first page might be “SELECT * FROM myTable LIMIT 10” and the second page might be “SELECT * FROM myTable OFFSET 10 LIMIT 10”, and so on.

    Ordering the data can be an expensive operation depending on the database, the number of rows, etc. Alternatively, it is possible to specify a column whose values will be used to determine the pages, using the Column for Value Partitioning property. If set, GenerateTableFetch will determine the minimum and maximum values for the column, and uses the minimum value as the initial offset. The SQL to fetch a page is then based on this initial offset and the total difference in values (i.e. maximum - minimum) divided by the page size. For example, if the column “id” is used for value partitioning, and the column contains values 100 to 200, then with a page size of 10 the SQL to fetch the first page might be “SELECT * FROM myTable WHERE id >= 100 AND id < 110” and the second page might be “SELECT * FROM myTable WHERE id >= 110 AND id < 120”, and so on.

    It is important that the Column for Value Partitioning be set to a column whose type can be coerced to a long integer ( i.e. not date or timestamp), and that the column values are evenly distributed and not sparse, for best performance. As a counterexample to the above, consider a column “id” whose values are 100, 2000, and 30000. If the Partition Size is 100, then the column values are relatively sparse, so the SQL for the “second page” (see above example) will return zero rows, and so will every page until the value in the query becomes “id >= 2000”. Another counterexample is when the values are not uniformly distributed. Consider a column “id” with values 100, 200, 201, 202, … 299. Then the SQL for the first page (see above example) will return one row with value id = 100, and the second page will return 100 rows with values 200 … 299. This can cause inconsistent processing times downstream, as the pages may contain a very different number of rows. For these reasons it is recommended to use a Column for Value Partitioning that is sufficiently dense (not sparse) and fairly evenly distributed.

Properties
Dynamic Properties
State Management
Scopes Description
CLUSTER After performing a query on the specified table, the maximum values for the specified column(s) will be retained for use in future executions of the query. This allows the Processor to fetch only those records that have max values greater than the retained values. This can be used for incremental fetching, fetching of newly added rows, etc. To clear the maximum values, clear the state of the processor per the State Management documentation
Relationships
Name Description
failure This relationship is only used when SQL query execution (using an incoming FlowFile) failed. The incoming FlowFile will be penalized and routed to this relationship. If no incoming connection(s) are specified, this relationship is unused.
success Successfully created FlowFile from SQL query result set.
Writes Attributes
Name Description
generatetablefetch.sql.error If the processor has incoming connections, and processing an incoming FlowFile causes a SQL Exception, the FlowFile is routed to failure and this attribute is set to the exception message.
generatetablefetch.tableName The name of the database table to be queried.
generatetablefetch.columnNames The comma-separated list of column names used in the query.
generatetablefetch.whereClause Where clause used in the query to get the expected rows.
generatetablefetch.maxColumnNames The comma-separated list of column names used to keep track of data that has been returned since the processor started running.
generatetablefetch.limit The number of result rows to be fetched by the SQL statement.
generatetablefetch.offset Offset to be used to retrieve the corresponding partition.
fragment.identifier All FlowFiles generated from the same query result set will have the same value for the fragment.identifier attribute. This can then be used to correlate the results.
fragment.count This is the total number of FlowFiles produced by a single ResultSet. This can be used in conjunction with the fragment.identifier attribute in order to know how many FlowFiles belonged to the same incoming ResultSet.
fragment.index This is the position of this FlowFile in the list of outgoing FlowFiles that were all generated from the same execution. This can be used in conjunction with the fragment.identifier attribute to know which FlowFiles originated from the same execution and in what order FlowFiles were produced
See Also