Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Oct 22, 2025

What issue does this PR close?

Partially address #1749.

Rationale for this change

This PR fixes a bug in delete file loading when a FileScanTask contains both positional and equality delete files. We hit this when running Iceberg Java test suite via Comet in apache/datafusion-comet#2528. The tests that failed were

TestSparkExecutorCache > testMergeOnReadUpdate()
TestSparkExecutorCache > testMergeOnReadMerge()
TestSparkExecutorCache > testMergeOnReadDelete()

The Bug:
The condition in try_start_eq_del_load (delete_filter.rs:71-73) was inverted. It returned None when the equality delete file was not in the cache, causing the loader to skip loading it. When build_equality_delete_predicate was later called, it would fail with "Missing predicate for equality delete file".

What changes are included in this PR?

The Fix:

  • Inverted the condition so it returns None when the file is already in the cache (being loaded or loaded), preventing duplicate work across concurrent tasks
  • When the file is not in the cache, mark it as Loading and proceed with loading

Additional Changes:

  • Added test case test_load_deletes_with_mixed_types that reproduces the bug scenario

Are these changes tested?

Yes, this PR includes a new unit test test_load_deletes_with_mixed_types that:

  • Creates a FileScanTask with both a positional delete file and an equality delete file
  • Verifies that load_deletes successfully processes both types
  • Verifies that build_equality_delete_predicate succeeds without the "Missing predicate" error
  • We hit this when running Iceberg Java test suite via Comet in feat: Iceberg scan based serializing FileScanTasks to iceberg-rust datafusion-comet#2528. I also confirmed that it fixes the tests in Iceberg Java's suite.

The test would fail before this fix and passes after.

@mbutrovich mbutrovich changed the title fix(reader): Support both position and equality delete on the same FileScanTask fix(reader): Support both position and equality delete files on the same FileScanTask Oct 22, 2025
schema,
)
.await?,
batch_stream: basic_delete_file_loader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, according to iceberg spec, we must do schema evolution. I think the correct approach is to fix arrow's schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thank you for the reference! I'm still learning my way around the Iceberg spec so I appreciate the check. I am addressing your comments in #1777 and it will result in a new schema being passed to ArrowReaderOptions so let me get that sorted first, and then maybe I can adapt the changes here to the modified schema being passed into ArrowReaderOptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I accidentally brought unintended changes from #1782 here. I will address your comment in that PR. Thanks again @liurenjie1024!

@mbutrovich mbutrovich marked this pull request as ready for review October 29, 2025 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants