ListHDFS

Description:

Retrieves a listing of files from HDFS. For each file that is listed in HDFS, this processor creates a FlowFile that represents the HDFS file to be fetched in conjunction with FetchHDFS. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Unlike GetHDFS, this Processor does not delete any data from HDFS.

Additional Details...

Tags:

hadoop, HCFS, HDFS, get, list, ingest, source, filesystem

Properties:

In the list below, the names of required properties appear in bold. Any other properties (not in bold) are considered optional. The table also indicates any default values, and whether a property supports the NiFi Expression Language.

Display Name	API Name	Default Value	Allowable Values	Description
Hadoop Configuration Resources	Hadoop Configuration Resources			A file or comma separated list of files which contains the Hadoop file system configuration. Without this, Hadoop will search the classpath for a 'core-site.xml' and 'hdfs-site.xml' file or will revert to a default configuration. To use swebhdfs, see 'Additional Details' section of PutHDFS's documentation. This property expects a comma-separated list of file resources. Supports Expression Language: true (will be evaluated using variable registry only)
Kerberos Credentials Service	kerberos-credentials-service		Controller Service API: KerberosCredentialsService Implementation: KeytabCredentialsService	Specifies the Kerberos Credentials Controller Service that should be used for authenticating with Kerberos
Kerberos User Service	kerberos-user-service		Controller Service API: KerberosUserService Implementations: KerberosPasswordUserService KerberosKeytabUserService KerberosTicketCacheUserService	Specifies the Kerberos User Controller Service that should be used for authenticating with Kerberos
Kerberos Principal	Kerberos Principal			Kerberos principal to authenticate as. Requires nifi.kerberos.krb5.file to be set in your nifi.properties Supports Expression Language: true (will be evaluated using variable registry only)
Kerberos Keytab	Kerberos Keytab			Kerberos keytab associated with the principal. Requires nifi.kerberos.krb5.file to be set in your nifi.properties This property requires exactly one file to be provided.. Supports Expression Language: true (will be evaluated using variable registry only)
Kerberos Password	Kerberos Password			Kerberos password associated with the principal. Sensitive Property: true
Kerberos Relogin Period	Kerberos Relogin Period	4 hours		Period of time which should pass before attempting a kerberos relogin. This property has been deprecated, and has no effect on processing. Relogins now occur automatically. Supports Expression Language: true (will be evaluated using variable registry only)
Additional Classpath Resources	Additional Classpath Resources			A comma-separated list of paths to files and/or directories that will be added to the classpath and used for loading native libraries. When specifying a directory, all files with in the directory will be added to the classpath, but further sub-directories will not be included. This property expects a comma-separated list of resources. Each of the resources may be of any of the following types: file, directory.
Directory	Directory			The HDFS directory from which files should be read Supports Expression Language: true (will be evaluated using variable registry only)
Recurse Subdirectories	Recurse Subdirectories	true	true false	Indicates whether to list files from subdirectories of the HDFS directory
Record Writer	record-writer		Controller Service API: RecordSetWriterFactory Implementations: JsonRecordSetWriter RecordSetWriterLookup AvroRecordSetWriter XMLRecordSetWriter FreeFormTextRecordSetWriter CSVRecordSetWriter ParquetRecordSetWriter ScriptedRecordSetWriter	Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile.
File Filter	File Filter	[^\.].*		Only files whose names match the given regular expression will be picked up
File Filter Mode	file-filter-mode	Directories and Files	Directories and Files Files Only Full Path	Determines how the regular expression in File Filter will be used when retrieving listings.
Minimum File Age	minimum-file-age			The minimum age that a file must be in order to be pulled; any file younger than this amount of time (based on last modification date) will be ignored
Maximum File Age	maximum-file-age			The maximum age that a file must be in order to be pulled; any file older than this amount of time (based on last modification date) will be ignored. Minimum value is 100ms.

Relationships:

Name	Description
success	All FlowFiles are transferred to this relationship

Reads Attributes:

None specified.

Writes Attributes:

Name	Description
filename	The name of the file that was read from HDFS.
path	The path is set to the absolute path of the file's directory on HDFS. For example, if the Directory property is set to /tmp, then files picked up from /tmp will have the path attribute set to "./". If the Recurse Subdirectories property is set to true and a file is picked up from /tmp/abc/1/2/3, then the path attribute will be set to "/tmp/abc/1/2/3".
hdfs.owner	The user that owns the file in HDFS
hdfs.group	The group that owns the file in HDFS
hdfs.lastModified	The timestamp of when the file in HDFS was last modified, as milliseconds since midnight Jan 1, 1970 UTC
hdfs.length	The number of bytes in the file in HDFS
hdfs.replication	The number of HDFS replicas for hte file
hdfs.permissions	The permissions for the file in HDFS. This is formatted as 3 characters for the owner, 3 for the group, and 3 for other users. For example rw-rw-r--

State management:

Scope	Description
CLUSTER	After performing a listing of HDFS files, the latest timestamp of all the files listed is stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run, without having to store all of the actual filenames/paths which could lead to performance problems. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Scope

Description

CLUSTER

After performing a listing of HDFS files, the latest timestamp of all the files listed is stored. This allows the Processor to list only files that have been added or modified after this date the next time that the Processor is run, without having to store all of the actual filenames/paths which could lead to performance problems. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Restricted:

This component is not restricted.

Input requirement:

This component does not allow an incoming relationship.

System Resource Considerations: