ListS3

ListS3 2.4.0

Bundle: org.apache.nifi | nifi-aws-nar
Description: Retrieves a listing of objects from an S3 bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchS3Object. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.
Tags: AWS, Amazon, S3, list
Input Requirement: FORBIDDEN
Supports Sensitive Dynamic Properties: false

Additional Details for ListS3 2.4.0
ListS3

Streaming Versus Batch Processing

ListS3 performs a listing of all S3 Objects that it encounters in the configured S3 bucket. There are two common, broadly defined use cases.

Streaming Use Case

By default, the Processor will create a separate FlowFile for each object in the bucket and add attributes for filename, bucket, etc. A common use case is to connect ListS3 to the FetchS3 processor. These two processors used in conjunction with one another provide the ability to easily monitor a bucket and fetch the contents of any new object as it lands in S3 in an efficient streaming fashion.

Batch Use Case

Another common use case is the desire to process all newly arriving objects in a given bucket, and to then perform some action only when all objects have completed their processing. The above approach of streaming the data makes this difficult, because NiFi is inherently a streaming platform in that there is no “job” that has a beginning and an end. Data is simply picked up as it becomes available.

To solve this, the ListS3 Processor can optionally be configured with a Record Writer. When a Record Writer is configured, a single FlowFile will be created that will contain a Record for each object in the bucket, instead of a separate FlowFile per object. See the documentation for ListFile for an example of how to build a dataflow that allows for processing all the objects before proceeding with any other step.

One important difference between the data produced by ListFile and ListS3, though, is the structure of the Records that are emitted. The Records emitted by ListFile have a different schema than those emitted by ListS3. ListS3 emits records that follow the following schema (in Avro format):
```
{
  "type": "record",
  "name": "nifiRecord",
  "namespace": "org.apache.nifi",
  "fields": [
    {
      "name": "key",
      "type": "string"
    },
    {
      "name": "bucket",
      "type": "string"
    },
    {
      "name": "owner",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "etag",
      "type": "string"
    },
    {
      "name": "lastModified",
      "type": {
        "type": "long",
        "logicalType": "timestamp-millis"
      }
    },
    {
      "name": "size",
      "type": "long"
    },
    {
      "name": "storageClass",
      "type": "string"
    },
    {
      "name": "latest",
      "type": "boolean"
    },
    {
      "name": "versionId",
      "type": [
        "null",
        "string"
      ]
    },
    {
      "name": "tags",
      "type": [
        "null",
        {
          "type": "map",
          "values": "string"
        }
      ]
    },
    {
      "name": "userMetadata",
      "type": [
        "null",
        {
          "type": "map",
          "values": "string"
        }
      ]
    }
  ]
}
```

Properties

AWS Credentials Provider Service
The Controller Service that is used to obtain AWS credentials provider

Display Name

AWS Credentials Provider Service

Description

The Controller Service that is used to obtain AWS credentials provider

API Name

AWS Credentials Provider service

Service Interface

org.apache.nifi.processors.aws.credentials.provider.service.AWSCredentialsProviderService

Service Implementations

org.apache.nifi.processors.aws.credentials.provider.service.AWSCredentialsProviderControllerService

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Bucket
The S3 Bucket to interact with

Display Name

Bucket

Description

The S3 Bucket to interact with

API Name

Bucket

Expression Language Scope

Environment variables and FlowFile Attributes

Sensitive

false

Required

true
Communications Timeout
The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

Display Name

Communications Timeout

Description

The amount of time to wait in order to establish a connection to AWS or receive data from AWS before timing out.

API Name

Communications Timeout

Default Value

30 secs

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Custom Signer Class Name
Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.
Display Name

Custom Signer Class Name

Description

Fully qualified class name of the custom signer class. The signer must implement com.amazonaws.auth.Signer interface.

API Name

custom-signer-class-name

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

true

Dependencies
- Signer Override is set to any of [CustomSignerType]
Custom Signer Module Location
Comma-separated list of paths to files and/or directories which contain the custom signer's JAR file and its dependencies (if any).
Display Name

Custom Signer Module Location

Description

Comma-separated list of paths to files and/or directories which contain the custom signer's JAR file and its dependencies (if any).

API Name

custom-signer-module-location

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

false

Dependencies
- Signer Override is set to any of [CustomSignerType]
Delimiter
The string used to delimit directories within the bucket. Please consult the AWS documentation for the correct use of this field.

Display Name

Delimiter

Description

The string used to delimit directories within the bucket. Please consult the AWS documentation for the correct use of this field.

API Name

delimiter

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Endpoint Override URL
Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

Display Name

Endpoint Override URL

Description

Endpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.

API Name

Endpoint Override URL

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

false
Entity Tracking Initial Listing Target
Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.
Display Name

Entity Tracking Initial Listing Target

Description

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

API Name

et-initial-listing-target

Default Value

all

Allowable Values
- Tracking Time Window
- All Available
Expression Language Scope

Not Supported

Sensitive

false

Required

true

Dependencies
- Listing Strategy is set to any of [entities]
Entity Tracking State Cache
Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.
Display Name

Entity Tracking State Cache

Description

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

API Name

et-state-cache

Service Interface

org.apache.nifi.distributed.cache.client.DistributedMapCacheClient

Service Implementations

org.apache.nifi.hazelcast.services.cacheclient.HazelcastMapCacheClient

org.apache.nifi.distributed.cache.client.MapCacheClientService

org.apache.nifi.redis.service.RedisDistributedMapCacheClientService

org.apache.nifi.redis.service.SimpleRedisDistributedMapCacheClientService

Expression Language Scope

Not Supported

Sensitive

false

Required

true

Dependencies
- Listing Strategy is set to any of [entities]
Entity Tracking Time Window
Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity's timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.
Display Name

Entity Tracking Time Window

Description

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity's timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

API Name

et-time-window

Default Value

3 hours

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

true

Dependencies
- Entity Tracking State Cache is set to any value specified
List Type
Specifies whether to use the original List Objects or the newer List Objects Version 2 endpoint.
Display Name

List Type

Description

Specifies whether to use the original List Objects or the newer List Objects Version 2 endpoint.

API Name

list-type

Default Value

1

Allowable Values
- List Objects V1
- List Objects V2
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Listing Batch Size
If not using a Record Writer, this property dictates how many S3 objects should be listed in a single batch. Once this number is reached, the FlowFiles that have been created will be transferred out of the Processor. Setting this value lower may result in lower latency by sending out the FlowFiles before the complete listing has finished. However, it can significantly reduce performance. Larger values may take more memory to store all of the information before sending the FlowFiles out. This property is ignored if using a Record Writer, as one of the main benefits of the Record Writer is being able to emit the entire listing as a single FlowFile.

Display Name

Listing Batch Size

Description

If not using a Record Writer, this property dictates how many S3 objects should be listed in a single batch. Once this number is reached, the FlowFiles that have been created will be transferred out of the Processor. Setting this value lower may result in lower latency by sending out the FlowFiles before the complete listing has finished. However, it can significantly reduce performance. Larger values may take more memory to store all of the information before sending the FlowFiles out. This property is ignored if using a Record Writer, as one of the main benefits of the Record Writer is being able to emit the entire listing as a single FlowFile.

API Name

Listing Batch Size

Default Value

100

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Listing Strategy
Specify how to determine new/updated entities. See each strategy descriptions for detail.
Display Name

Listing Strategy

Description

Specify how to determine new/updated entities. See each strategy descriptions for detail.

API Name

listing-strategy

Default Value

timestamps

Allowable Values
- Tracking Timestamps
- Tracking Entities
- No Tracking
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Maximum Object Age
The maximum age that an S3 object can be in order to be considered; any object older than this amount of time (according to last modification date) will be ignored

Display Name

Maximum Object Age

Description

The maximum age that an S3 object can be in order to be considered; any object older than this amount of time (according to last modification date) will be ignored

API Name

max-age

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Minimum Object Age
The minimum age that an S3 object must be in order to be considered; any object younger than this amount of time (according to last modification date) will be ignored

Display Name

Minimum Object Age

Description

The minimum age that an S3 object must be in order to be considered; any object younger than this amount of time (according to last modification date) will be ignored

API Name

min-age

Default Value

0 sec

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Prefix
The prefix used to filter the object list. Do not begin with a forward slash '/'. In most cases, it should end with a forward slash '/'.

Display Name

Prefix

Description

The prefix used to filter the object list. Do not begin with a forward slash '/'. In most cases, it should end with a forward slash '/'.

API Name

prefix

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

false
Proxy Configuration Service
Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

Display Name

Proxy Configuration Service

Description

Specifies the Proxy Configuration Controller Service to proxy network requests. Supported proxies: HTTP + AuthN

API Name

proxy-configuration-service

Service Interface

org.apache.nifi.proxy.ProxyConfigurationService

Service Implementations

org.apache.nifi.proxy.StandardProxyConfigurationService

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Record Writer
Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Display Name

Record Writer

Description

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

API Name

record-writer

Service Interface

org.apache.nifi.serialization.RecordSetWriterFactory

Service Implementations

org.apache.nifi.avro.AvroRecordSetWriter

org.apache.nifi.csv.CSVRecordSetWriter

org.apache.nifi.text.FreeFormTextRecordSetWriter

org.apache.nifi.json.JsonRecordSetWriter

org.apache.nifi.lookup.RecordSetWriterLookup

org.apache.nifi.record.script.ScriptedRecordSetWriter

org.apache.nifi.xml.XMLRecordSetWriter

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Region
The AWS Region to connect to.
Display Name

Region

Description

The AWS Region to connect to.

API Name

Region

Default Value

us-west-2

Allowable Values
- AWS GovCloud (US)
- AWS GovCloud (US-East)
- US East (N. Virginia)
- US East (Ohio)
- US West (N. California)
- US West (Oregon)
- EU (Ireland)
- EU (London)
- EU (Paris)
- EU (Frankfurt)
- EU (Zurich)
- EU (Stockholm)
- EU (Milan)
- EU (Spain)
- Asia Pacific (Hong Kong)
- Asia Pacific (Mumbai)
- Asia Pacific (Hyderabad)
- Asia Pacific (Singapore)
- Asia Pacific (Sydney)
- Asia Pacific (Jakarta)
- Asia Pacific (Melbourne)
- Asia Pacific (Tokyo)
- Asia Pacific (Seoul)
- Asia Pacific (Osaka)
- South America (Sao Paulo)
- China (Beijing)
- China (Ningxia)
- Canada (Central)
- Canada West (Calgary)
- Middle East (UAE)
- Middle East (Bahrain)
- Africa (Cape Town)
- US ISO East
- US ISOB East (Ohio)
- US ISO West
- Israel (Tel Aviv)
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Requester Pays
If true, indicates that the requester consents to pay any charges associated with listing the S3 bucket. This sets the 'x-amz-request-payer' header to 'requester'. Note that this setting is not applicable when 'Use Versions' is 'true'.
Display Name

Requester Pays

Description

If true, indicates that the requester consents to pay any charges associated with listing the S3 bucket. This sets the 'x-amz-request-payer' header to 'requester'. Note that this setting is not applicable when 'Use Versions' is 'true'.

API Name

requester-pays

Default Value

false

Allowable Values
- True
- False
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Signer Override
The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.
Display Name

Signer Override

Description

The AWS S3 library uses Signature Version 4 by default but this property allows you to specify the Version 2 signer to support older S3-compatible services or even to plug in your own custom signer implementation.

API Name

Signer Override

Default Value

Default Signature

Allowable Values
- Default Signature
- Signature Version 4
- Signature Version 2
- Custom Signature
Expression Language Scope

Not Supported

Sensitive

false

Required

false
SSL Context Service
Specifies an optional SSL Context Service that, if provided, will be used to create connections

Display Name

SSL Context Service

Description

Specifies an optional SSL Context Service that, if provided, will be used to create connections

API Name

SSL Context Service

Service Interface

org.apache.nifi.ssl.SSLContextProvider

Service Implementations

org.apache.nifi.ssl.PEMEncodedSSLContextProvider

org.apache.nifi.ssl.StandardRestrictedSSLContextService

org.apache.nifi.ssl.StandardSSLContextService

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Use Versions
Specifies whether to use S3 versions, if applicable. If false, only the latest version of each object will be returned.
Display Name

Use Versions

Description

Specifies whether to use S3 versions, if applicable. If false, only the latest version of each object will be returned.

API Name

use-versions

Default Value

false

Allowable Values
- true
- false
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Write Object Tags
If set to 'True', the tags associated with the S3 object will be written as FlowFile attributes
Display Name

Write Object Tags

Description

If set to 'True', the tags associated with the S3 object will be written as FlowFile attributes

API Name

write-s3-object-tags

Default Value

false

Allowable Values
- True
- False
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Write User Metadata
If set to 'True', the user defined metadata associated with the S3 object will be added to FlowFile attributes/records
Display Name

Write User Metadata

Description

If set to 'True', the user defined metadata associated with the S3 object will be added to FlowFile attributes/records

API Name

write-s3-user-metadata

Default Value

false

Allowable Values
- True
- False
Expression Language Scope

Not Supported

Sensitive

false

Required

true

State Management

Scopes	Description
CLUSTER	After performing a listing of keys, the timestamp of the newest key is stored, along with the keys that share that same timestamp. This allows the Processor to list only keys that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Scopes

Description

CLUSTER

After performing a listing of keys, the timestamp of the newest key is stored, along with the keys that share that same timestamp. This allows the Processor to list only keys that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Relationships

Name	Description
success	FlowFiles are routed to this Relationship after they have been successfully processed.

Writes Attributes

Name	Description
s3.bucket	The name of the S3 bucket
s3.region	The region of the S3 bucket
filename	The name of the file
s3.etag	The ETag that can be used to see if the file has changed
s3.isLatest	A boolean indicating if this is the latest version of the object
s3.lastModified	The last modified time in milliseconds since epoch in UTC time
s3.length	The size of the object in bytes
s3.storeClass	The storage class of the object
s3.version	The version of the object, if applicable
s3.tag.___	If 'Write Object Tags' is set to 'True', the tags associated to the S3 object that is being listed will be written as part of the flowfile attributes
s3.user.metadata.___	If 'Write User Metadata' is set to 'True', the user defined metadata associated to the S3 object that is being listed will be written as part of the flowfile attributes

ListS3 2.4.0

ListS3

Streaming Versus Batch Processing

Streaming Use Case

Batch Use Case