HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #12055

steveloughran · 2025-01-22T19:48:10Z

Feature Request / Improvement

Hadoop Filesystems now support paged bulk delete API.

For most filesystems the page size is 1; it simply mapped to a single file delete.

For S3A, the page size is the value of fs.s3a.bulk.delete.page.size
-each page of deletions is executed as a single bulk delete POST in the AWS API

There are no attempts to implement POSIX "safety checks" for the path being a directory,
the parent directory existing afterwards etc.

As such it is the most efficient way to delete many objects, whose performance
should match that of to S3FileIO.deleteFiles().
For filesystems without bulk delete, this is mapped to delete(path), so
no less efficient than normal delete calls.

Support this in Iceberg:

Add new option iceberg.hadoop.bulk.delete.enabled (default: false)
Use reflection to use the bulk delete API through the reflection friendly
org.apache.hadoop.io.wrappedio.WrappedIO class
Switch to using the bulk delete mechanism if enabled and present

If active, bulk delete is supported by:

building up a page of paths to delete for every target filesystem.
Initiate an asynchronous bulk delete request whenever a page is full.
When the end of the list has been reached: queue page deletes for
all incomplete pages.
Await results and report failures as such.

Missing files are not reported as failures -these are not detected.
Failures will be in permissions, network and possibly transient endpoint issues.

Adds a parameterized test to verify bulk delete works.
This needs to be run against hadoop 3.4.1 to actually verify coverage.

Testing this feature all the way to S3 is complicated.
A test within the hadoop-aws module can validate
the feature through HadoopFileIO and act as regression
testing for the S3A Connector.

Query engine

None

Willingness to contribute

I can contribute this improvement/feature independently
I would be willing to contribute this improvement/feature with guidance from the Iceberg community
I cannot contribute this improvement/feature at this time

The text was updated successfully, but these errors were encountered:

steveloughran added the improvement PR that improves existing functionality label Jan 22, 2025

steveloughran mentioned this issue Jan 22, 2025

HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #10233

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #12055

HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #12055

steveloughran commented Jan 22, 2025

HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #12055

HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #12055

Comments

steveloughran commented Jan 22, 2025

Feature Request / Improvement

Query engine

Willingness to contribute