ListS3

Description:

Retrieves a listing of objects from an S3 bucket. For each object that is listed, creates a FlowFile that represents the object so that it can be fetched in conjunction with FetchS3Object. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data.

Additional Details...

Tags:

Amazon, S3, AWS, list

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, whether a property supports the NiFi Expression Language, and whether a property is considered "sensitive", meaning that its value will be encrypted. Before entering a value in a sensitive property, ensure that the nifi.properties file has an entry for the property nifi.sensitive.props.key.

NameDefault ValueAllowable ValuesDescription
Listing StrategyTracking Timestamps
  • Tracking Timestamps This strategy tracks the latest timestamp of listed entity to determine new/updated entities. Since it only tracks few timestamps, it can manage listing state efficiently. This strategy will not pick up any newly added or modified entity if their timestamps are older than the tracked latest timestamp. Also may miss files when multiple subdirectories are being written at the same time while listing is running.
  • Tracking Entities This strategy tracks information of all the listed entities within the latest 'Entity Tracking Time Window' to determine new/updated entities. This strategy can pick entities having old timestamp that can be missed with 'Tracing Timestamps'. Works even when multiple subdirectories are being written at the same time while listing is running. However an additional DistributedMapCache controller service is required and more JVM heap memory is used. For more information on how the 'Entity Tracking Time Window' property works, see the description.
Specify how to determine new/updated entities. See each strategy descriptions for detail.
Entity Tracking State CacheController Service API:
DistributedMapCacheClient
Implementations: HBase_1_1_2_ClientMapCacheService
RedisDistributedMapCacheClientService
CouchbaseMapCacheClient
CassandraDistributedMapCache
HazelcastMapCacheClient
DistributedMapCacheClientService
HBase_2_ClientMapCacheService
Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

This Property is only considered if the [Listing Strategy] Property has a value of "Tracking Entities".
Entity Tracking Initial Listing TargetAll Available
  • Tracking Time Window Ignore entities having timestamp older than the specified 'Tracking Time Window' at the initial listing activity.
  • All Available Regardless of entities timestamp, all existing entities will be listed at the initial listing activity.
Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

This Property is only considered if the [Listing Strategy] Property has a value of "Tracking Entities".
Entity Tracking Time Window3 hoursSpecify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity's timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.
Supports Expression Language: true (will be evaluated using variable registry only)

This Property is only considered if the [Entity Tracking Initial Listing Target] Property has a value of "Tracking Time Window".
BucketNo Description Provided.
Supports Expression Language: true (will be evaluated using variable registry only)
RegionUS West (Oregon)
  • AWS GovCloud (US) AWS Region Code : us-gov-west-1
  • AWS GovCloud (US-East) AWS Region Code : us-gov-east-1
  • US East (N. Virginia) AWS Region Code : us-east-1
  • US East (Ohio) AWS Region Code : us-east-2
  • US West (N. California) AWS Region Code : us-west-1
  • US West (Oregon) AWS Region Code : us-west-2
  • EU (Ireland) AWS Region Code : eu-west-1
  • EU (London) AWS Region Code : eu-west-2
  • EU (Paris) AWS Region Code : eu-west-3
  • EU (Frankfurt) AWS Region Code : eu-central-1
  • EU (Stockholm) AWS Region Code : eu-north-1
  • EU (Milan) AWS Region Code : eu-south-1
  • Asia Pacific (Hong Kong) AWS Region Code : ap-east-1
  • Asia Pacific (Mumbai) AWS Region Code : ap-south-1
  • Asia Pacific (Singapore) AWS Region Code : ap-southeast-1
  • Asia Pacific (Sydney) AWS Region Code : ap-southeast-2
  • Asia Pacific (Jakarta) AWS Region Code : ap-southeast-3
  • Asia Pacific (Tokyo) AWS Region Code : ap-northeast-1
  • Asia Pacific (Seoul) AWS Region Code : ap-northeast-2
  • Asia Pacific (Osaka) AWS Region Code : ap-northeast-3
  • South America (Sao Paulo) AWS Region Code : sa-east-1
  • China (Beijing) AWS Region Code : cn-north-1
  • China (Ningxia) AWS Region Code : cn-northwest-1
  • Canada (Central) AWS Region Code : ca-central-1
  • Middle East (UAE) AWS Region Code : me-central-1
  • Middle East (Bahrain) AWS Region Code : me-south-1
  • Africa (Cape Town) AWS Region Code : af-south-1
  • US ISO East AWS Region Code : us-iso-east-1
  • US ISOB East (Ohio) AWS Region Code : us-isob-east-1
  • US ISO West AWS Region Code : us-iso-west-1
No Description Provided.
Access Key IDNo Description Provided.
Sensitive Property: true
Supports Expression Language: true (will be evaluated using variable registry only)
Secret Access KeyNo Description Provided.
Sensitive Property: true
Supports Expression Language: true (will be evaluated using variable registry only)
Record WriterController Service API:
RecordSetWriterFactory
Implementations: RecordSetWriterLookup
AvroRecordSetWriter
FreeFormTextRecordSetWriter
XMLRecordSetWriter
ScriptedRecordSetWriter
CSVRecordSetWriter
JsonRecordSetWriter
ParquetRecordSetWriter
Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.
Minimum Object Age0 secThe minimum age that an S3 object must be in order to be considered; any object younger than this amount of time (according to last modification date) will be ignored
Listing Batch Size100If not using a Record Writer, this property dictates how many S3 objects should be listed in a single batch. Once this number is reached, the FlowFiles that have been created will be transferred out of the Processor. Setting this value lower may result in lower latency by sending out the FlowFiles before the complete listing has finished. However, it can significantly reduce performance. Larger values may take more memory to store all of the information before sending the FlowFiles out. This property is ignored if using a Record Writer, as one of the main benefits of the Record Writer is being able to emit the entire listing as a single FlowFile.
Write Object TagsFalse
  • True
  • False
If set to 'True', the tags associated with the S3 object will be written as FlowFile attributes
Write User MetadataFalse
  • True
  • False
If set to 'True', the user defined metadata associated with the S3 object will be added to FlowFile attributes/records
Credentials FilePath to a file containing AWS access key and secret key in properties file format.

This property requires exactly one file to be provided..
AWS Credentials Provider ServiceController Service API:
AWSCredentialsProviderService
Implementation: AWSCredentialsProviderControllerService
The Controller Service that is used to obtain aws credentials provider
Communications Timeout30 secsNo Description Provided.
SSL Context ServiceController Service API:
SSLContextService
Implementations: StandardSSLContextService
StandardRestrictedSSLContextService
Specifies an optional SSL Context Service that, if provided, will be used to create connections
Endpoint Override URLEndpoint URL to use instead of the AWS default including scheme, host, port, and path. The AWS libraries select an endpoint URL based on the AWS region, but this property overrides the selected endpoint URL, allowing use with other S3-compatible endpoints.
Supports Expression Language: true (will be evaluated using variable registry only)
Signer OverrideDefault Signature
  • Default Signature
  • Signature v4
  • Signature v2
The AWS libraries use the default signer but this property allows you to specify a custom signer to support older S3-compatible services.
Proxy Configuration ServiceController Service API:
ProxyConfigurationService
Implementation: StandardProxyConfigurationService
Specifies the Proxy Configuration Controller Service to proxy network requests. If set, it supersedes proxy settings configured per component. Supported proxies: HTTP + AuthN
Proxy HostProxy host name or IP
Supports Expression Language: true (will be evaluated using variable registry only)
Proxy Host PortProxy host port
Supports Expression Language: true (will be evaluated using variable registry only)
Proxy UsernameProxy username
Supports Expression Language: true (undefined scope)
Proxy PasswordProxy password
Sensitive Property: true
Supports Expression Language: true (undefined scope)
DelimiterThe string used to delimit directories within the bucket. Please consult the AWS documentation for the correct use of this field.
PrefixThe prefix used to filter the object list. In most cases, it should end with a forward slash ('/').
Supports Expression Language: true (will be evaluated using variable registry only)
Use Versionsfalse
  • true
  • false
Specifies whether to use S3 versions, if applicable. If false, only the latest version of each object will be returned.
List TypeList Objects V1
  • List Objects V1
  • List Objects V2
Specifies whether to use the original List Objects or the newer List Objects Version 2 endpoint.
Requester PaysFalse
  • True Indicates that the requester consents to pay any charges associated with listing the S3 bucket.
  • False Does not consent to pay requester charges for listing the S3 bucket.
If true, indicates that the requester consents to pay any charges associated with listing the S3 bucket. This sets the 'x-amz-request-payer' header to 'requester'. Note that this setting is not applicable when 'Use Versions' is 'true'.

Relationships:

NameDescription
successFlowFiles are routed to success relationship

Reads Attributes:

None specified.

Writes Attributes:

NameDescription
s3.bucketThe name of the S3 bucket
filenameThe name of the file
s3.etagThe ETag that can be used to see if the file has changed
s3.isLatestA boolean indicating if this is the latest version of the object
s3.lastModifiedThe last modified time in milliseconds since epoch in UTC time
s3.lengthThe size of the object in bytes
s3.storeClassThe storage class of the object
s3.versionThe version of the object, if applicable
s3.tag.___If 'Write Object Tags' is set to 'True', the tags associated to the S3 object that is being listed will be written as part of the flowfile attributes
s3.user.metadata.___If 'Write User Metadata' is set to 'True', the user defined metadata associated to the S3 object that is being listed will be written as part of the flowfile attributes

State management:

ScopeDescription
CLUSTERAfter performing a listing of keys, the timestamp of the newest key is stored, along with the keys that share that same timestamp. This allows the Processor to list only keys that have been added or modified after this date the next time that the Processor is run. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Restricted:

This component is not restricted.

Input requirement:

This component does not allow an incoming relationship.

System Resource Considerations:

None specified.

See Also:

FetchS3Object, PutS3Object, DeleteS3Object