-
Notifications
You must be signed in to change notification settings - Fork 338
fix(reader): Support both position and equality delete files on the same FileScanTask #1778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| schema, | ||
| ) | ||
| .await?, | ||
| batch_stream: basic_delete_file_loader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, according to iceberg spec, we must do schema evolution. I think the correct approach is to fix arrow's schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thank you for the reference! I'm still learning my way around the Iceberg spec so I appreciate the check. I am addressing your comments in #1777 and it will result in a new schema being passed to ArrowReaderOptions so let me get that sorted first, and then maybe I can adapt the changes here to the modified schema being passed into ArrowReaderOptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I accidentally brought unintended changes from #1782 here. I will address your comment in that PR. Thanks again @liurenjie1024!
What issue does this PR close?
Partially address #1749.
Rationale for this change
This PR fixes a bug in delete file loading when a
FileScanTaskcontains both positional and equality delete files. We hit this when running Iceberg Java test suite via Comet in apache/datafusion-comet#2528. The tests that failed wereThe Bug:
The condition in
try_start_eq_del_load(delete_filter.rs:71-73) was inverted. It returnedNonewhen the equality delete file was not in the cache, causing the loader to skip loading it. Whenbuild_equality_delete_predicatewas later called, it would fail with "Missing predicate for equality delete file".What changes are included in this PR?
The Fix:
Nonewhen the file is already in the cache (being loaded or loaded), preventing duplicate work across concurrent tasksAdditional Changes:
test_load_deletes_with_mixed_typesthat reproduces the bug scenarioAre these changes tested?
Yes, this PR includes a new unit test
test_load_deletes_with_mixed_typesthat:FileScanTaskwith both a positional delete file and an equality delete fileload_deletessuccessfully processes both typesbuild_equality_delete_predicatesucceeds without the "Missing predicate" errorThe test would fail before this fix and passes after.