Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HadoopFileIO to support bulk delete through the Hadoop Filesystem APIs #12055

Open
2 of 3 tasks
steveloughran opened this issue Jan 22, 2025 · 0 comments
Open
2 of 3 tasks
Labels
improvement PR that improves existing functionality

Comments

@steveloughran
Copy link
Contributor

Feature Request / Improvement

Hadoop Filesystems now support paged bulk delete API.

For most filesystems the page size is 1; it simply mapped to a single file delete.

For S3A, the page size is the value of fs.s3a.bulk.delete.page.size
-each page of deletions is executed as a single bulk delete POST in the AWS API

There are no attempts to implement POSIX "safety checks" for the path being a directory,
the parent directory existing afterwards etc.

As such it is the most efficient way to delete many objects, whose performance
should match that of to S3FileIO.deleteFiles().
For filesystems without bulk delete, this is mapped to delete(path), so
no less efficient than normal delete calls.

Support this in Iceberg:

  • Add new option iceberg.hadoop.bulk.delete.enabled (default: false)
  • Use reflection to use the bulk delete API through the reflection friendly
    org.apache.hadoop.io.wrappedio.WrappedIO class
  • Switch to using the bulk delete mechanism if enabled and present

If active, bulk delete is supported by:

  1. building up a page of paths to delete for every target filesystem.
  2. Initiate an asynchronous bulk delete request whenever a page is full.
  3. When the end of the list has been reached: queue page deletes for
    all incomplete pages.
  4. Await results and report failures as such.

Missing files are not reported as failures -these are not detected.
Failures will be in permissions, network and possibly transient endpoint issues.

Adds a parameterized test to verify bulk delete works.
This needs to be run against hadoop 3.4.1 to actually verify coverage.

Testing this feature all the way to S3 is complicated.
A test within the hadoop-aws module can validate
the feature through HadoopFileIO and act as regression
testing for the S3A Connector.

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement PR that improves existing functionality
Projects
None yet
Development

No branches or pull requests

1 participant