Skip to content

Conversation

@ismailsimsek
Copy link
Contributor

@ismailsimsek ismailsimsek commented Jan 4, 2025

Continuing #7914

  • added pathFilter to new method (listWithPrefix), PartitionAwareHiddenPathFilter
  • Added tests that new method and previous one returns same values
  • Fix failing test testHiddenPathsStartingWithPartitionNamesAreIgnored

With this change, current executions now use HadoopFileIO, which implements DelegateFileIO and SupportPrefixOperations. This results in calls to the new listWithPrefix method.

@github-actions github-actions bot added the spark label Jan 4, 2025
@ismailsimsek ismailsimsek force-pushed the fix-remove-orphan-file-action branch from 0846141 to 6267e48 Compare January 4, 2025 19:16
@ismailsimsek ismailsimsek force-pushed the fix-remove-orphan-file-action branch from 6267e48 to a191684 Compare January 5, 2025 11:59
@ismailsimsek
Copy link
Contributor Author

cc @flyrain @RussellSpitzer @rahil-c its ready for review and test added. also will appreciate any suggestion on the failing test.

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jan 7, 2025

The test here says it's failling because you are deleting

    but the following elements were unexpected:
      ["file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt"]
        ```
        
When those are not files that are traditionally removed by remove orphan files (which only removes files with certain prefixes.) This is a behavior change and I don't think it's actually beneficial so is there any way to fix this?

```TestRemoveOrphanFilesAction3 > testHiddenPathsStartingWithPartitionNamesAreIgnored() > formatVersion = 3 FAILED
    java.lang.AssertionError: [same as] 
    Expecting actual:
      ["file:/tmp/junit-14563533605645158466/metadata/v1.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/v2.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/2607cb5e-4b09-49d3-8887-52a5341aaaa9-m0.avro",
        "file:/tmp/junit-14563533605645158466/metadata/version-hint.text",
        "file:/tmp/junit-14563533605645158466/metadata/snap-1334501503484114515-1-2607cb5e-4b09-49d3-8887-52a5341aaaa9.avro",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc=AA/c3=AAAA/00000-864-80ba5c39-c3da-4be3-9783-5d8666c89ccc-0-00001.parquet"]
    to contain exactly in any order:
      ["file:/tmp/junit-14563533605645158466/metadata/v1.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/v2.metadata.json",
        "file:/tmp/junit-14563533605645158466/metadata/2607cb5e-4b09-49d3-8887-52a5341aaaa9-m0.avro",
        "file:/tmp/junit-14563533605645158466/metadata/version-hint.text",
        "file:/tmp/junit-14563533605645158466/metadata/snap-1334501503484114515-1-2607cb5e-4b09-49d3-8887-52a5341aaaa9.avro",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc=AA/c3=AAAA/00000-864-80ba5c39-c3da-4be3-9783-5d8666c89ccc-0-00001.parquet"]
    but the following elements were unexpected:
      ["file:/tmp/junit-14563533605645158466/data/_c2_trunc/subfolder/file.txt",
        "file:/tmp/junit-14563533605645158466/data/_c2_trunc/file.txt"]

.isTrue();

DeleteOrphanFiles.Result result3 =
DeleteOrphanFilesSparkAction action3 =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are modifying things here, please rename all name# variables to something relevant to the test. The names should be relevant to what we are checking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to more descriptive names

@ismailsimsek ismailsimsek force-pushed the fix-remove-orphan-file-action branch from bdd982c to 2212f2a Compare January 8, 2025 12:34
Comment on lines +648 to +674
public boolean hasHiddenPttParentFolder(Path path) {
return Stream.iterate(path, Path::getParent)
.takeWhile(Objects::nonNull)
.anyMatch(parentPath -> !doAccept(parentPath));
}
Copy link
Contributor Author

@ismailsimsek ismailsimsek Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it will check parent folders per file, to ensure none of the parent folder is hiddenpartition folder. this might be less performant for large list, if performance is a concern.

Comment on lines 321 to 318
// NOTE: check the path relative to table location. To avoid checking un necessary root
// folders
Path relativeFilePath = new Path(fileInfo.location().replace(location, ""));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating relative path to avoid checking parent folders of the table. however this replace(location, "")); might not be the best solution. open to any ideas

@danielcweeks
Copy link
Contributor

@ismailsimsek my issue with this PR is the same as the previous pr. This isn't a scaleable solution. The file system approach was able to parallelize the work through directory traversal, but this does not.

I think we need a way to break up the prefixes appropriately so that we can distribute the listing.

@danielcweeks danielcweeks self-requested a review January 15, 2025 19:13
}

@VisibleForTesting
Dataset<String> listWithPrefix() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to break up the key space, possibly by taking hints from what LocationProvider is configured for the table. A single listing is not scalable.

Copy link
Contributor Author

@ismailsimsek ismailsimsek Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks having trouble how to get sub-folders and list them separately using existing classes.

does it makes sense to add new method to SupportsPrefixOperations interface?

something like below:
Iterable<FileInfo> listPrefix(String prefix, boolean currentDirectory);
with cloud libraries delimiter (/) parameter could be used for this

So the idea is doing it in 2 steps
1- first make call and get sub-directories (using new method, delimiter="/").
2- then do full listing per prefix + sub-directory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to add delimiter if at all possible. I think the right approach is to enumerate the key space of the first character (or the first few characters) and then distribute the key space for executors to process as tasks.

Depending on the layout strategy, this could be different, but it is generally predictable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, with the new object store layout, you would be able to use a binary representation for the first three characters (on task for each prefix 000 to 111) and then have separate tasks for anything above and below the binary characters and then distribute that processing.

@RussellSpitzer
Copy link
Member

@ismailsimsek my issue with this PR is the same as the previous pr. This isn't a scaleable solution. The file system approach was able to parallelize the work through directory traversal, but this does not.

I think we need a way to break up the prefixes appropriately so that we can distribute the listing.

Do we have some technical docs on the performance of the listPrefix approach? I tried to look this up but couldn't find anything other than some old stack overflow posts saying it worked on 80million entries in someones workflow. I just want to make sure we aren't parallelizing something on the client that isn't already parallelized on the server.

I think @danielcweeks is right that the naive approach here could be very dangerous if the server implementation of list prefix was not internally distributed.

@ismailsimsek ismailsimsek force-pushed the fix-remove-orphan-file-action branch from b4f5ede to 5c9b49c Compare January 24, 2025 20:07
Comment on lines +339 to +359
if (table.locationProvider() instanceof LocationProviders.ObjectStoreLocationProvider) {
// ObjectStoreLocationProvider generates hierarchical prefixes in a binary fashion
// (0000/, 0001/, 0010/, 0011/, ...).
// This allows us to parallelize listing operations across these prefixes.
List<String> prefixes =
List.of(
"/0000", "/0001", "/0010", "/0011", "/0100", "/0101", "/0110", "/0111", "/1000",
"/1001", "/1010", "/1011", "/1100", "/1101", "/1110", "/1111");

String tableDataLocationRoot = table.locationProvider().dataLocationRoot();
for (String prefix : prefixes) {
List<String> result = listLocationWithPrefix(tableDataLocationRoot + prefix, pathFilter);
matchingFiles.addAll(result);
}

} else {
matchingFiles.addAll(listLocationWithPrefix(location, pathFilter));
}

JavaRDD<String> matchingFileRDD = sparkContext().parallelize(matchingFiles, 1);
return spark().createDataset(matchingFileRDD.rdd(), Encoders.STRING());
Copy link
Contributor Author

@ismailsimsek ismailsimsek Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielcweeks @RussellSpitzer on high level, does this reflects the idea of multiple listing? do we need to use spark instead of the loop, to parallelize it?

few blocking limitations are (please correct if its wrong):

  • listPrefix requires exact prefix, cannot use regexp
  • listPrefix requires exact folder as prefix, cannot use first few characters of a folder as listing prefix
    • with above both scenarios and when the prefix not eixsts, HadoopFileIO throws java.io.FileNotFoundException for the given prefix. (i expect other FileIO implementations are also behaving same)
  • listPrefix cannot be used to list prefixes below >xxxx/0000 or above <xxxx/11111.
    • so only prefixes with these numeric hashes are listed. its not prosible to list other folders under data location since we cannot determine folder names.
  • had to extend LocationProvider to give dataLocationRoot so that we can use it to list exact prefix with hashes.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 25, 2025
@github-actions
Copy link

github-actions bot commented Mar 4, 2025

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants