You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hadoop Filesystems now support paged bulk delete API.
For most filesystems the page size is 1; it simply mapped to a single file delete.
For S3A, the page size is the value of fs.s3a.bulk.delete.page.size
-each page of deletions is executed as a single bulk delete POST in the AWS API
There are no attempts to implement POSIX "safety checks" for the path being a directory,
the parent directory existing afterwards etc.
As such it is the most efficient way to delete many objects, whose performance
should match that of to S3FileIO.deleteFiles().
For filesystems without bulk delete, this is mapped to delete(path), so
no less efficient than normal delete calls.
Support this in Iceberg:
Add new option iceberg.hadoop.bulk.delete.enabled (default: false)
Use reflection to use the bulk delete API through the reflection friendly org.apache.hadoop.io.wrappedio.WrappedIO class
Switch to using the bulk delete mechanism if enabled and present
If active, bulk delete is supported by:
building up a page of paths to delete for every target filesystem.
Initiate an asynchronous bulk delete request whenever a page is full.
When the end of the list has been reached: queue page deletes for
all incomplete pages.
Await results and report failures as such.
Missing files are not reported as failures -these are not detected.
Failures will be in permissions, network and possibly transient endpoint issues.
Adds a parameterized test to verify bulk delete works.
This needs to be run against hadoop 3.4.1 to actually verify coverage.
Testing this feature all the way to S3 is complicated.
A test within the hadoop-aws module can validate
the feature through HadoopFileIO and act as regression
testing for the S3A Connector.
Query engine
None
Willingness to contribute
I can contribute this improvement/feature independently
I would be willing to contribute this improvement/feature with guidance from the Iceberg community
I cannot contribute this improvement/feature at this time
The text was updated successfully, but these errors were encountered:
Feature Request / Improvement
Hadoop Filesystems now support paged bulk delete API.
For most filesystems the page size is 1; it simply mapped to a single file delete.
For S3A, the page size is the value of
fs.s3a.bulk.delete.page.size
-each page of deletions is executed as a single bulk delete POST in the AWS API
There are no attempts to implement POSIX "safety checks" for the path being a directory,
the parent directory existing afterwards etc.
As such it is the most efficient way to delete many objects, whose performance
should match that of to
S3FileIO.deleteFiles()
.For filesystems without bulk delete, this is mapped to delete(path), so
no less efficient than normal delete calls.
Support this in Iceberg:
iceberg.hadoop.bulk.delete.enabled
(default: false)org.apache.hadoop.io.wrappedio.WrappedIO
classIf active, bulk delete is supported by:
all incomplete pages.
Missing files are not reported as failures -these are not detected.
Failures will be in permissions, network and possibly transient endpoint issues.
Adds a parameterized test to verify bulk delete works.
This needs to be run against hadoop 3.4.1 to actually verify coverage.
Testing this feature all the way to S3 is complicated.
A test within the hadoop-aws module can validate
the feature through HadoopFileIO and act as regression
testing for the S3A Connector.
Query engine
None
Willingness to contribute
The text was updated successfully, but these errors were encountered: