-
Notifications
You must be signed in to change notification settings - Fork 3k
Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SupportsPrefixOperations for Remove OrphanFile Procedure #11906
Conversation
0846141 to
6267e48
Compare
...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java
Show resolved
Hide resolved
.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java
Show resolved
Hide resolved
6267e48 to
a191684
Compare
|
cc @flyrain @RussellSpitzer @rahil-c its ready for review and test added. also will appreciate any suggestion on the failing test. |
|
The test here says it's failling because you are deleting |
| .isTrue(); | ||
|
|
||
| DeleteOrphanFiles.Result result3 = | ||
| DeleteOrphanFilesSparkAction action3 = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are modifying things here, please rename all name# variables to something relevant to the test. The names should be relevant to what we are checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to more descriptive names
bdd982c to
2212f2a
Compare
| public boolean hasHiddenPttParentFolder(Path path) { | ||
| return Stream.iterate(path, Path::getParent) | ||
| .takeWhile(Objects::nonNull) | ||
| .anyMatch(parentPath -> !doAccept(parentPath)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it will check parent folders per file, to ensure none of the parent folder is hiddenpartition folder. this might be less performant for large list, if performance is a concern.
| // NOTE: check the path relative to table location. To avoid checking un necessary root | ||
| // folders | ||
| Path relativeFilePath = new Path(fileInfo.location().replace(location, "")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creating relative path to avoid checking parent folders of the table. however this replace(location, "")); might not be the best solution. open to any ideas
|
@ismailsimsek my issue with this PR is the same as the previous pr. This isn't a scaleable solution. The file system approach was able to parallelize the work through directory traversal, but this does not. I think we need a way to break up the prefixes appropriately so that we can distribute the listing. |
| } | ||
|
|
||
| @VisibleForTesting | ||
| Dataset<String> listWithPrefix() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to break up the key space, possibly by taking hints from what LocationProvider is configured for the table. A single listing is not scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielcweeks having trouble how to get sub-folders and list them separately using existing classes.
does it makes sense to add new method to SupportsPrefixOperations interface?
something like below:
Iterable<FileInfo> listPrefix(String prefix, boolean currentDirectory);
with cloud libraries delimiter (/) parameter could be used for this
So the idea is doing it in 2 steps
1- first make call and get sub-directories (using new method, delimiter="/").
2- then do full listing per prefix + sub-directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to add delimiter if at all possible. I think the right approach is to enumerate the key space of the first character (or the first few characters) and then distribute the key space for executors to process as tasks.
Depending on the layout strategy, this could be different, but it is generally predictable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, with the new object store layout, you would be able to use a binary representation for the first three characters (on task for each prefix 000 to 111) and then have separate tasks for anything above and below the binary characters and then distribute that processing.
Do we have some technical docs on the performance of the listPrefix approach? I tried to look this up but couldn't find anything other than some old stack overflow posts saying it worked on 80million entries in someones workflow. I just want to make sure we aren't parallelizing something on the client that isn't already parallelized on the server. I think @danielcweeks is right that the naive approach here could be very dangerous if the server implementation of list prefix was not internally distributed. |
… 3.5, improve naming Co-authored-by: Rahil Chertara <[email protected]>
b4f5ede to
5c9b49c
Compare
| if (table.locationProvider() instanceof LocationProviders.ObjectStoreLocationProvider) { | ||
| // ObjectStoreLocationProvider generates hierarchical prefixes in a binary fashion | ||
| // (0000/, 0001/, 0010/, 0011/, ...). | ||
| // This allows us to parallelize listing operations across these prefixes. | ||
| List<String> prefixes = | ||
| List.of( | ||
| "/0000", "/0001", "/0010", "/0011", "/0100", "/0101", "/0110", "/0111", "/1000", | ||
| "/1001", "/1010", "/1011", "/1100", "/1101", "/1110", "/1111"); | ||
|
|
||
| String tableDataLocationRoot = table.locationProvider().dataLocationRoot(); | ||
| for (String prefix : prefixes) { | ||
| List<String> result = listLocationWithPrefix(tableDataLocationRoot + prefix, pathFilter); | ||
| matchingFiles.addAll(result); | ||
| } | ||
|
|
||
| } else { | ||
| matchingFiles.addAll(listLocationWithPrefix(location, pathFilter)); | ||
| } | ||
|
|
||
| JavaRDD<String> matchingFileRDD = sparkContext().parallelize(matchingFiles, 1); | ||
| return spark().createDataset(matchingFileRDD.rdd(), Encoders.STRING()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielcweeks @RussellSpitzer on high level, does this reflects the idea of multiple listing? do we need to use spark instead of the loop, to parallelize it?
few blocking limitations are (please correct if its wrong):
- listPrefix requires exact prefix, cannot use regexp
- listPrefix requires exact folder as prefix, cannot use first few characters of a folder as listing prefix
- with above both scenarios and when the prefix not eixsts, HadoopFileIO throws
java.io.FileNotFoundExceptionfor the given prefix. (i expect other FileIO implementations are also behaving same)
- with above both scenarios and when the prefix not eixsts, HadoopFileIO throws
- listPrefix cannot be used to list prefixes below >
xxxx/0000or above <xxxx/11111.- so only prefixes with these numeric hashes are listed. its not prosible to list other folders under data location since we cannot determine folder names.
- had to extend
LocationProviderto give dataLocationRoot so that we can use it to list exact prefix with hashes.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Continuing #7914
listWithPrefix),PartitionAwareHiddenPathFiltertestHiddenPathsStartingWithPartitionNamesAreIgnoredWith this change, current executions now use
HadoopFileIO, which implementsDelegateFileIOandSupportPrefixOperations. This results in calls to the newlistWithPrefixmethod.