PutDynamoDBRecord 2.0.0

Bundle
org.apache.nifi | nifi-aws-nar
Description
Inserts items into DynamoDB based on record-oriented data. The record fields are mapped into DynamoDB item fields, including partition and sort keys if set. Depending on the number of records the processor might execute the insert in multiple chunks in order to overcome DynamoDB's limitation on batch writing. This might result partially processed FlowFiles in which case the FlowFile will be transferred to the "unprocessed" relationship with the necessary attribute to retry later without duplicating the already executed inserts.
Tags
AWS, Amazon, DynamoDB, Insert, Put, Record
Input Requirement
REQUIRED
Supports Sensitive Dynamic Properties
false
  • Additional Details for PutDynamoDBRecord 2.0.0

    PutDynamoDBRecord

    Description

    PutDynamoDBRecord intends to provide the capability to insert multiple Items into a DynamoDB table from a record-oriented FlowFile. Compared to the PutDynamoDB, this processor is capable to process data based other than JSON format too and prepared to add multiple fields for a given Item. Also, PutDynamoDBRecord is designed to insert bigger batches of data into the database.

    Data types

    The list data types supported by DynamoDB does not fully overlap with the capabilities of the Record data structure. Some conversions and simplifications are necessary during inserting the data. These are:

    • Numeric values are stored using a floating-point data structure within Items. In some cases this representation might cause issues with the accuracy.
    • Char is not a supported type within DynamoDB, these fields are converted into String values.
    • Enum types are stored as String fields, using the name of the given enum.
    • DynamoDB stores time and date related information as Strings.
    • Internal record structures are converted into maps.
    • Choice is not a supported data type, regardless of the actual wrapped data type, values enveloped in Choice are handled as Strings.
    • Unknown data types are handled as stings.

    Limitations

    Working with DynamoDB when batch inserting comes with two inherit limitations. First, the number of inserted Items is limited to 25 in any case. In order to overcome this, during one execution, depending on the number of records in the incoming FlowFile, PutDynamoDBRecord might attempt multiple insert calls towards the database server. Using this approach, the flow does not have to work with this limitation in most cases.

    Having multiple external actions comes with the risk of having an unforeseen result at one of the steps. For example when the incoming FlowFile is consists of 70 records, it will be split into 3 chunks, with a single insert operation for every chunk. The first two chunks contains 25 Items to insert per chunk, and the third contains the remaining 20. In some cases it might occur that the first two insert operation succeeds but the third one fails. In these cases we consider the FlowFile “partially processed” and we will transfer it to the “failure” or “unprocessed” Relationship according to the nature of the issue. In order to keep the information about the successfully processed chunks the processor assigns the “dynamodb.chunks.processed” attribute to the FlowFile, which has the number of successfully processed chunks as value.

    The most common reason for this behaviour comes from the other limitation the inserts have with DynamoDB: the database has a build in supervision over the amount of inserted data. When a client reaches the “throughput limit”, the server refuses to process the insert request until a certain amount of time. More information here. From the perspective of the PutDynamoDBRecord we consider these cases as temporary issues and the FlowFile will be transferred to the “unprocessed” Relationship after which the processor will yield in order to avoid further throughput issues. (Other kinds of failures will result transfer to the “failure” Relationship)

    Retry

    It is suggested to loop back the “unprocessed” Relationship to the PutDynamoDBRecord in some way. FlowFiles transferred to that relationship considered as healthy ones might be successfully processed in a later point. It is possible that the FlowFile contains such a high number of records, what needs more than two attempts to fully insert. The attribute “dynamodb.chunks.processed” is “rolled” through the attempts, which means, after each trigger it will contain the sum number of inserted chunks making it possible for the later attempts to continue from the right point without duplicated inserts.

    Partition and sort keys

    The processor supports multiple strategies for assigning partition key and sort key to the inserted Items. These are:

    Partition Key Strategies

    Partition By Field

    The processors assign one of the record fields as partition key. The name of the record field is specified by the " Partition Key Field" property and the value will be the value of the record field with the same name.

    Partition By Attribute

    The processor assigns the value of a FlowFile attribute as partition key. With this strategy all the Items within a FlowFile will share the same partition key value, and it is suggested to use for tables also having a sort key in order to meet the primary key requirements of the DynamoDB. The property “Partition Key Field” defines the name of the Item field and the property “Partition Key Attribute” will specify which attribute’s value will be assigned to the partition key. With this strategy the “Partition Key Field” must be different from the fields consisted by the incoming records.

    Generated UUID

    By using this strategy the processor will generate a UUID identifier for every single Item. This identifier will be used as value for the partition key. The name of the field used as partition key is defined by the property “Partition Key Field”. With this strategy the “Partition Key Field” must be different from the fields consisted by the incoming records. When using this strategy, the partition key in the DynamoDB table must have String data type.

    Sort Key Strategies

    None

    No sort key will be assigned to the Item. In case of the table definition expects it, using this strategy will result unsuccessful inserts.

    Sort By Field

    The processors assign one of the record fields as sort key. The name of the record field is specified by the “Sort Key Field” property and the value will be the value of the record field with the same name. With this strategy the “Sort Key Field” must be different from the fields consisted by the incoming records.

    Generate Sequence

    The processor assigns a generated value to every Item based on the original record’s position in the incoming FlowFile ( regardless of the chunks). The first Item will have the sort key 1, the second will have sort key 2 and so on. The generated keys are unique within a given FlowFile. The name of the record field is specified by the “Sort Key Field” attribute. With this strategy the “Sort Key Field” must be different from the fields consisted by the incoming records. When using this strategy, the sort key in the DynamoDB table must have Number data type.

    Examples

    Using fields as partition and sort key

    Setup

    • Partition Key Strategy: Partition By Field
    • Partition Key Field: class
    • Sort Key Strategy: Sort By Field
    • Sort Key Field: size

    Note: both fields have to exist in the incoming records!

    Result

    Using this pair of strategies will result Items identical to the incoming record (not counting the representational changes from the conversion). The field specified by the properties are added to the Items normally with the only difference of flagged as (primary) key items.

    Input

    [
      {
        "type": "A",
        "subtype": 4,
        "class": "t",
        "size": 1
      }
    ]
    

    Output (stylized)

    • type: String field with value “A”
    • subtype: Number field with value 4
    • class: String field with value “t” and serving as partition key
    • size: Number field with value 1 and serving as sort key

    Using FlowFile filename as partition key with generated sort key

    Setup

    • Partition Key Strategy: Partition By Attribute
    • Partition Key Field: source
    • Partition Key Attribute: filename
    • Sort Key Strategy: Generate Sequence
    • Sort Key Field: sort

    Result

    The FlowFile’s filename attribute will be used as partition key. In this case all the records within the same FlowFile will share the same partition key. In order to avoid collusion, if FlowFiles contain multiple records, using sort key is suggested. In this case a generated sequence is used which is guaranteed to be unique within a given FlowFile.

    Input

    [
      {
        "type": "A",
        "subtype": 4,
        "class": "t",
        "size": 1
      },
      {
        "type": "B",
        "subtype": 5,
        "class": "m",
        "size": 2
      }
    ]
    

    Output (stylized)

    First Item
    • source: String field with value “data46362.json” and serving as partition key
    • type: String field with value “A”
    • subtype: Number field with value 4
    • class: String field with value “t”
    • size: Number field with value 1
    • sort: Number field with value 1 and serving as sort key
    Second Item
    • source: String field with value “data46362.json” and serving as partition key
    • type: String field with value “B”
    • subtype: Number field with value 5
    • class: String field with value “m”
    • size: Number field with value 2
    • sort: Number field with value 2 and serving as sort key

    Using generated partition key

    Setup

    • Partition Key Strategy: Generated UUID
    • Partition Key Field: identifier
    • Sort Key Strategy: None

    Result

    A generated UUID will be used as partition key. A different UUID will be generated for every Item.

    Input

    [
      {
        "type": "A",
        "subtype": 4,
        "class": "t",
        "size": 1
      }
    ]
    

    Output (stylized)

    • identifier: String field with value “872ab776-ed73-4d37-a04a-807f0297e06e” and serving as partition key
    • type: String field with value “A”
    • subtype: Number field with value 4
    • class: String field with value “t”
    • size: Number field with value 1
Properties
System Resource Considerations
Resource Description
MEMORY An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.
NETWORK An instance of this component can cause high usage of this system resource. Multiple instances or high concurrency settings may result a degradation of performance.
Relationships
Name Description
unprocessed FlowFiles are routed to unprocessed relationship when DynamoDB is not able to process all the items in the request. Typical reasons are insufficient table throughput capacity and exceeding the maximum bytes per request. Unprocessed FlowFiles can be retried with a new request.
success FlowFiles are routed to success relationship
failure FlowFiles are routed to failure relationship
Reads Attributes
Name Description
dynamodb.chunks.processed Number of chunks successfully inserted into DynamoDB. If not set, it is considered as 0
Writes Attributes
Name Description
dynamodb.chunks.processed Number of chunks successfully inserted into DynamoDB. If not set, it is considered as 0
dynamodb.key.error.unprocessed DynamoDB unprocessed keys
dynmodb.range.key.value.error DynamoDB range key error
dynamodb.key.error.not.found DynamoDB key not found
dynamodb.error.exception.message DynamoDB exception message
dynamodb.error.code DynamoDB error code
dynamodb.error.message DynamoDB error message
dynamodb.error.service DynamoDB error service
dynamodb.error.retryable DynamoDB error is retryable
dynamodb.error.request.id DynamoDB error request id
dynamodb.error.status.code DynamoDB error status code
dynamodb.item.io.error IO exception message on creating item
See Also