Description

PutDynamoDBRecord intends to provide the capability to insert multiple Items into a DynamoDB table from a record-oriented FlowFile. Compared to the PutDynamoDB, this processor is capable to process data based other than JSON format too and prepared to add multiple fields for a given Item. Also, PutDynamoDBRecord is designed to insert bigger batches of data into the database.

Data types

The list data types supported by DynamoDB does not fully overlaps with the capabilities of the Record data structure. Some conversions and simplifications are necessary during inserting the data. These are:

Limitations

Working with DynamoDB when batch inserting comes with two inherit limitations. First, the number of inserted Items is limited to 25 in any case. In order to overcome this, during one execution, depending on the number or records in the incoming FlowFile, PutDynamoDBRecord might attempt multiple insert calls towards the database server. Using this approach, the flow does not have to work with this limitation in most cases.

Having multiple external actions comes with the risk of having an unforeseen result at one of the steps. For example when the incoming FlowFile is consists of 70 records, it will be split into 3 chunks, with a single insert operation for every chunk. The first two chunks contains 25 Items to insert per chunk, and the third contains the remaining 20. In some cases it might occur that the first two insert operation succeeds but the third one fails. In these cases we consider the FlowFile "partially processed" and we will transfer it to the "failure" or "unprocessed" Relationship according to the nature of the issue. In order to keep the information about the successfully processed chunks the processor assigns the "dynamodb.chunks.processed" attribute to the FlowFile, which has the number of successfully processed chunks as value.

The most common reason for this behaviour comes from the other limitation the inserts have with DynamoDB: the database has a build in supervision over the amount of inserted data. When a client reaches the "throughput limit", the server refuses to process the insert request until a certain amount of time. More information on this might be find here. From the perspective of the PutDynamoDBRecord we consider these cases as temporary issues and the FlowFile will be transferred to the "unprocessed" Relationship after which the processor will yield in order to avoid further throughput issues. (Other kinds of failures will result transfer to the "failure" Relationship)

Retry

It is suggested to loop back the "unprocessed" Relationship to the PutDynamoDBRecord in some way. FlowFiles transferred to that relationship considered as healthy ones might be successfully processed in a later point. It is possible that the FlowFile contains such a high number of records, what needs more than two attempts to fully insert. The attribute "dynamodb.chunks.processed" is "rolled" through the attempts, which means, after each trigger it will contain the sum number of inserted chunks making it possible for the later attempts to continue from the right point without duplicated inserts.

Partition and sort keys

The processor supports multiple strategies for assigning partition key and sort key to the inserted Items. These are:

Partition Key Strategies

Partition By Field

The processors assigns one of the record fields as partition key. The name of the record field is specified by the "Partition Key Field" property and the value will be the value of the record field with the same name.

Partition By Attribute

The processor assigns the value of a FlowFile attribute as partition key. With this strategy all the Items within a FlowFile will share the same partition key value and it is suggested to use for tables also having a sort key in order to meet the primary key requirements of the DynamoDB. The property "Partition Key Field" defines the name of the Item field and the property "Partition Key Attribute" will specify which attribute's value will be assigned to the partition key. With this strategy the "Partition Key Field" must be different from the fields consisted by the incoming records.

Generated UUID

By using this strategy the processor will generate a UUID identifier for every single Item. This identifier will be used as value for the partition key. The name of the field used as partition key is defined by the property "Partition Key Field". With this strategy the "Partition Key Field" must be different from the fields consisted by the incoming records. When using this strategy, the partition key in the DynamoDB table must have String data type.

Sort Key Strategies

None

No sort key will be assigned to the Item. In case of the table definition expects it, using this strategy will result unsuccessful inserts.

Sort By Field

The processors assigns one of the record fields as sort key. The name of the record field is specified by the "Sort Key Field" property and the value will be the value of the record field with the same name. With this strategy the "Sort Key Field" must be different from the fields consisted by the incoming records.

Generate Sequence

The processor assigns a generated value to every Item based on the original record's position in the incoming FlowFile (regardless of the chunks). The first Item will have the sort key 1, the second will have sort key 2 and so on. The generated keys are unique within a given FlowFile. The name of the record field is specified by the "Sort Key Field" attribute. With this strategy the "Sort Key Field" must be different from the fields consisted by the incoming records. When using this strategy, the sort key in the DynamoDB table must have Number data type.

Examples

Using fields as partition and sort key

Setup

Note: both fields have to exist in the incoming records!

Result

Using this pair of strategies will result Items identical to the incoming record (not counting the representational changes from the conversion). The field specified by the properties are added to the Items normally with the only difference of flagged as (primary) key items.

Input

[{"type": "A", "subtype": 4, "class" : "t", "size": 1}]

Output (stylized)

Using FlowFile filename as partition key with generated sort key

Setup

Result

The FlowFile's filename attribute will be used as partition key. In this case all the records within the same FlowFile will share the same partition key. In order to avoid collusion, if FlowFiles contain multiple records, using sort key is suggested. In this case a generated sequence is used which is guaranteed to be unique within a given FlowFile.

Input

[ {"type": "A", "subtype": 4, "class" : "t", "size": 1}, {"type": "B", "subtype": 5, "class" : "m", "size": 2} ]

Output (stylized)

First Item
Second Item

Using generated partition key

Setup

Result

A generated UUID will be used as partition key. A different UUID will be generated for every Item.

Input

[{"type": "A", "subtype": 4, "class" : "t", "size": 1}]

Output (stylized)