ListHDFS

Description:

Retrieves a listing of files from HDFS. Each time a listing is performed, the files with the latest timestamp will be excluded and picked up during the next execution of the processor. This is done to ensure that we do not miss any files, or produce duplicates, in the cases where files with the same timestamp are written immediately before and after a single execution of the processor. For each file that is listed in HDFS, this processor creates a FlowFile that represents the HDFS file to be fetched in conjunction with FetchHDFS. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetHDFS, this Processor does not delete any data from HDFS.

Tags:

hadoop, HDFS, get, list, ingest, source, filesystem

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

NameDefault ValueAllowable ValuesDescription
Hadoop Configuration ResourcesA file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS's documentation.
Supports Expression Language: true
Kerberos PrincipalKerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Supports Expression Language: true
Kerberos KeytabKerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties
Supports Expression Language: true
Kerberos Relogin Period4 hoursPeriod of time which should pass before attempting a kerberos relogin. This property has been deprecated, and has no effect on processing. Relogins now occur automatically.
Supports Expression Language: true
Additional Classpath ResourcesA comma-separated list of paths to files and/or directories that will be added to the classpath. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included.
Distributed Cache ServiceController Service API:
DistributedMapCacheClient
Implementations: HBase_1_1_2_ClientMapCacheService
DistributedMapCacheClientService
RedisDistributedMapCacheClientService
Specifies the Controller Service that should be used to maintain state about what has been pulled from HDFS so that if a new node begins pulling data, it won't duplicate all of the work that has been done.
DirectoryThe HDFS directory from which files should be read
Supports Expression Language: true
Recurse Subdirectoriestrue
  • true
  • false
Indicates whether to list files from subdirectories of the HDFS directory
File Filter[^\.].*Only files whose names match the given regular expression will be picked up
Minimum File AgeThe minimum age that a file must be in order to be pulled; any file younger than this amount of time (based on last modification date) will be ignored
Maximum File AgeThe maximum age that a file must be in order to be pulled; any file older than this amount of time (based on last modification date) will be ignored. Minimum value is 100ms.

Relationships:

NameDescription
successAll FlowFiles are transferred to this relationship

Reads Attributes:

None specified.

Writes Attributes:

NameDescription
filenameThe name of the file that was read from HDFS.
pathThe path is set to the absolute path of the file's directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "/tmp/abc/1/2/3".
hdfs.ownerThe user that owns the file in HDFS
hdfs.groupThe group that owns the file in HDFS
hdfs.lastModifiedThe timestamp of when the file in HDFS was last modified, as milliseconds since midnight Jan 1, 1970 UTC
hdfs.lengthThe number of bytes in the file in HDFS
hdfs.replicationThe number of HDFS replicas for hte file
hdfs.permissionsThe permissions for the file in HDFS. This is formatted as 3 characters for the owner, 3 for the group, and 3 for other users. For example rw-rw-r--

State management:

ScopeDescription
CLUSTERAfter performing a listing of HDFS files, the latest timestamp of all the files listed and the latest timestamp of all the files transferred are both stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run, without having to store all of the actual filenames/paths which could lead to performance problems. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Restricted:

This component is not restricted.

Input requirement:

This component does not allow an incoming relationship.

See Also:

GetHDFS, FetchHDFS, PutHDFS