Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906

ismailsimsek · 2025-01-04T18:47:42Z

Continuing #7914

added pathFilter to new method (listWithPrefix), PartitionAwareHiddenPathFilter
Added tests that new method and previous one returns same values
Fix failing test testHiddenPathsStartingWithPartitionNamesAreIgnored

With this change, current executions now use HadoopFileIO, which implements DelegateFileIO and SupportPrefixOperations. This results in calls to the new listWithPrefix method.

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

ismailsimsek · 2025-01-07T10:17:21Z

cc @flyrain @RussellSpitzer @rahil-c its ready for review and test added. also will appreciate any suggestion on the failing test.

RussellSpitzer · 2025-01-07T18:35:39Z

The test here says it's failling because you are deleting

    but the following elements were unexpected:
      ["file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt"]
        ```
        
When those are not files that are traditionally removed by remove orphan files (which only removes files with certain prefixes.) This is a behavior change and I don't think it's actually beneficial so is there any way to fix this?

```TestRemoveOrphanFilesAction3 > testHiddenPathsStartingWithPartitionNamesAreIgnored() > formatVersion = 3 FAILED
    java.lang.AssertionError: [same as] 
    Expecting actual:
      ["file:/tmp/junit-14563533605645158466/metadata/v1.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/v2.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/2607cb5e-4b09-49d3-8887-52a5341aaaa9-m0.avro",
        "file:/tmp/junit-14563533605645158466/metadata/version-hint.text",
        "file:/tmp/junit-14563533605645158466/metadata/snap-1334501503484114515-1-2607cb5e-4b09-49d3-8887-52a5341aaaa9.avro",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc=AA/c3=AAAA/00000-864-80ba5c39-c3da-4be3-9783-5d8666c89ccc-0-00001.parquet"]
    to contain exactly in any order:
      ["file:/tmp/junit-14563533605645158466/metadata/v1.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/v2.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/2607cb5e-4b09-49d3-8887-52a5341aaaa9-m0.avro",
        "file:/tmp/junit-14563533605645158466/metadata/version-hint.text",
        "file:/tmp/junit-14563533605645158466/metadata/snap-1334501503484114515-1-2607cb5e-4b09-49d3-8887-52a5341aaaa9.avro",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc=AA/c3=AAAA/00000-864-80ba5c39-c3da-4be3-9783-5d8666c89ccc-0-00001.parquet"]
    but the following elements were unexpected:
      ["file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt"]

RussellSpitzer · 2025-01-07T18:37:53Z

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

        .isTrue();

-    DeleteOrphanFiles.Result result3 =
+    DeleteOrphanFilesSparkAction action3 =


While we are modifying things here, please rename all name# variables to something relevant to the test. The names should be relevant to what we are checking

renamed to more descriptive names

ismailsimsek · 2025-01-08T14:30:44Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

+    public boolean hasHiddenPttParentFolder(Path path) {
+      return Stream.iterate(path, Path::getParent)
+          .takeWhile(Objects::nonNull)
+          .anyMatch(parentPath -> !doAccept(parentPath));
+    }


Now it will check parent folders per file, to ensure none of the parent folder is hiddenpartition folder. this might be less performant for large list, if performance is a concern.

ismailsimsek · 2025-01-14T09:46:51Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

+      // NOTE: check the path relative to table location. To avoid checking un necessary root
+      // folders
+      Path relativeFilePath = new Path(fileInfo.location().replace(location, ""));


creating relative path to avoid checking parent folders of the table. however this replace(location, "")); might not be the best solution. open to any ideas

danielcweeks · 2025-01-15T19:08:34Z

@ismailsimsek my issue with this PR is the same as the previous pr. This isn't a scaleable solution. The file system approach was able to parallelize the work through directory traversal, but this does not.

I think we need a way to break up the prefixes appropriately so that we can distribute the listing.

danielcweeks · 2025-01-15T19:14:35Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

+  }
+
+  @VisibleForTesting
+  Dataset<String> listWithPrefix() {


We need a way to break up the key space, possibly by taking hints from what LocationProvider is configured for the table. A single listing is not scalable.

@danielcweeks having trouble how to get sub-folders and list them separately using existing classes.

does it makes sense to add new method to SupportsPrefixOperations interface?

something like below:
Iterable<FileInfo> listPrefix(String prefix, boolean currentDirectory);
with cloud libraries delimiter (/) parameter could be used for this

So the idea is doing it in 2 steps
1- first make call and get sub-directories (using new method, delimiter="/").
2- then do full listing per prefix + sub-directory.

GCP delimiter example

azure lib delimiter parametter

aws delimiter parametter

We don't want to add delimiter if at all possible. I think the right approach is to enumerate the key space of the first character (or the first few characters) and then distribute the key space for executors to process as tasks.

Depending on the layout strategy, this could be different, but it is generally predictable.

For example, with the new object store layout, you would be able to use a binary representation for the first three characters (on task for each prefix 000 to 111) and then have separate tasks for anything above and below the binary characters and then distribute that processing.

RussellSpitzer · 2025-01-17T16:27:09Z

@ismailsimsek my issue with this PR is the same as the previous pr. This isn't a scaleable solution. The file system approach was able to parallelize the work through directory traversal, but this does not.

I think we need a way to break up the prefixes appropriately so that we can distribute the listing.

Do we have some technical docs on the performance of the listPrefix approach? I tried to look this up but couldn't find anything other than some old stack overflow posts saying it worked on 80million entries in someones workflow. I just want to make sure we aren't parallelizing something on the client that isn't already parallelized on the server.

I think @danielcweeks is right that the naive approach here could be very dangerous if the server implementation of list prefix was not internally distributed.

… 3.5

… 3.5, improve naming Co-authored-by: Rahil Chertara <[email protected]>

… 3.5

ismailsimsek · 2025-01-25T12:28:43Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

+    if (table.locationProvider() instanceof LocationProviders.ObjectStoreLocationProvider) {
+      // ObjectStoreLocationProvider generates hierarchical prefixes in a binary fashion
+      // (0000/, 0001/, 0010/, 0011/, ...).
+      // This allows us to parallelize listing operations across these prefixes.
+      List<String> prefixes =
+          List.of(
+              "/0000", "/0001", "/0010", "/0011", "/0100", "/0101", "/0110", "/0111", "/1000",
+              "/1001", "/1010", "/1011", "/1100", "/1101", "/1110", "/1111");
+
+      String tableDataLocationRoot = table.locationProvider().dataLocationRoot();
+      for (String prefix : prefixes) {
+        List<String> result = listLocationWithPrefix(tableDataLocationRoot + prefix, pathFilter);
+        matchingFiles.addAll(result);
+      }
+
+    } else {
+      matchingFiles.addAll(listLocationWithPrefix(location, pathFilter));
+    }
+
+    JavaRDD<String> matchingFileRDD = sparkContext().parallelize(matchingFiles, 1);
+    return spark().createDataset(matchingFileRDD.rdd(), Encoders.STRING());


@danielcweeks @RussellSpitzer on high level, does this reflects the idea of multiple listing? do we need to use spark instead of the loop, to parallelize it?

few blocking limitations are (please correct if its wrong):

listPrefix requires exact prefix, cannot use regexp

listPrefix requires exact folder as prefix, cannot use first few characters of a folder as listing prefix

with above both scenarios and when the prefix not eixsts, HadoopFileIO throws java.io.FileNotFoundException for the given prefix. (i expect other FileIO implementations are also behaving same)

listPrefix cannot be used to list prefixes below >xxxx/0000 or above <xxxx/11111.

so only prefixes with these numeric hashes are listed. its not prosible to list other folders under data location since we cannot determine folder names.

had to extend LocationProvider to give dataLocationRoot so that we can use it to list exact prefix with hashes.

github-actions · 2025-02-25T00:15:41Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2025-03-04T00:15:57Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added the spark label Jan 4, 2025

ismailsimsek force-pushed the fix-remove-orphan-file-action branch from 0846141 to 6267e48 Compare January 4, 2025 19:16

ismailsimsek commented Jan 4, 2025

View reviewed changes

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java Show resolved Hide resolved

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Show resolved Hide resolved

ismailsimsek force-pushed the fix-remove-orphan-file-action branch from 6267e48 to a191684 Compare January 5, 2025 11:59

RussellSpitzer reviewed Jan 7, 2025

View reviewed changes

ismailsimsek force-pushed the fix-remove-orphan-file-action branch from bdd982c to 2212f2a Compare January 8, 2025 12:34

ismailsimsek requested a review from RussellSpitzer January 8, 2025 14:28

ismailsimsek commented Jan 8, 2025

View reviewed changes

ismailsimsek commented Jan 14, 2025

View reviewed changes

danielcweeks self-requested a review January 15, 2025 19:13

danielcweeks reviewed Jan 15, 2025

View reviewed changes

Rahil Chertara and others added 4 commits January 19, 2025 14:34

Use SupportsPrefixOperations for Remove OrphanFile Procedure on Spark…

67289f2

… 3.5

Use SupportsPrefixOperations for Remove OrphanFile Procedure on Spark…

21066c0

… 3.5, improve naming Co-authored-by: Rahil Chertara <[email protected]>

Use SupportsPrefixOperations for Remove OrphanFile Procedure on Spark…

b2d4766

… 3.5

Draft use multiple listing when its ObjectStoreLocationProvider

5c9b49c

ismailsimsek force-pushed the fix-remove-orphan-file-action branch from b4f5ede to 5c9b49c Compare January 24, 2025 20:07

github-actions bot added API core labels Jan 24, 2025

Draft use multiple listing when its ObjectStoreLocationProvider

4c1fb27

ismailsimsek commented Jan 25, 2025

View reviewed changes

ismailsimsek mentioned this pull request Feb 18, 2025

List data and metadata directories instead of table root #12278

Closed

github-actions bot added the stale label Feb 25, 2025

github-actions bot closed this Mar 4, 2025

Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906

Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906

Uh oh!

Conversation

ismailsimsek commented Jan 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ismailsimsek commented Jan 7, 2025

Uh oh!

RussellSpitzer commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RussellSpitzer Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

ismailsimsek Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

ismailsimsek Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ismailsimsek Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

danielcweeks commented Jan 15, 2025

Uh oh!

danielcweeks Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

ismailsimsek Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielcweeks Jan 18, 2025

Choose a reason for hiding this comment

Uh oh!

danielcweeks Jan 18, 2025

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer commented Jan 17, 2025

Uh oh!

ismailsimsek Jan 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 25, 2025

Uh oh!

github-actions bot commented Mar 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ismailsimsek commented Jan 4, 2025 •

edited

Loading

RussellSpitzer commented Jan 7, 2025 •

edited

Loading

ismailsimsek Jan 8, 2025 •

edited

Loading

ismailsimsek Jan 17, 2025 •

edited

Loading

ismailsimsek Jan 25, 2025 •

edited

Loading