feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777

mbutrovich · 2025-10-22T02:00:28Z

What issue does this PR close?

Partially address #1749.

Rationale for this change

Background: This issue was discovered when running Iceberg Java's test suite against our experimental DataFusion Comet branch that uses iceberg-rust. Many failures occurred in TestMigrateTableAction.java, which tests reading Parquet files from migrated tables (e.g., from Hive or Spark) that lack embedded field ID metadata.

Problem: The Rust ArrowReader was unable to read these files, while Iceberg Java handles them using a position-based fallback where top-level field ID N maps to top-level Parquet column position N-1, and entire columns (including nested content) are projected.

What changes are included in this PR?

This PR implements position-based column projection for Parquet files without field IDs, enabling iceberg-rust to read migrated tables.

Solution: Implemented fallback projection in ArrowReader::get_arrow_projection_mask_fallback() that matches Java's ParquetSchemaUtil.pruneColumnsFallback() behavior:

Detects Parquet files without field IDs by checking Arrow schema metadata
Maps top-level field IDs to top-level column positions (field IDs are 1-indexed, positions are 0-indexed)
Uses ProjectionMask::roots() to project entire columns including nested content (structs, lists, maps)
Adds field ID metadata to the projected schema for RecordBatchTransformer
Supports schema evolution by allowing missing columns (filled with default values by RecordBatchTransformer)

This implementation now matches Iceberg Java's behavior for reading migrated tables, enabling interoperability with Java-based tooling and workflows.

Are these changes tested?

Yes, comprehensive unit tests were added to verify the fallback path works correctly:

test_read_parquet_file_without_field_ids - Basic projection with primitive columns using position-based mapping
test_read_parquet_without_field_ids_partial_projection - Project subset of columns
test_read_parquet_without_field_ids_schema_evolution - Handle missing columns with NULL values
test_read_parquet_without_field_ids_multiple_row_groups - Verify behavior across row group boundaries
test_read_parquet_without_field_ids_with_struct - Project structs with nested fields (entire top-level column)
test_read_parquet_without_field_ids_filter_eliminates_all_rows - Comet saw a panic when all row groups were filtered out, this reproduces that scenario
test_read_parquet_without_field_ids_schema_evolution_add_column_in_middle - Schema evolution with a column in the middle caused a panic at one point

All tests verify that behavior matches Iceberg Java's pruneColumnsFallback() implementation in
/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java.

mbutrovich · 2025-10-25T15:17:32Z

CI failure looks like an environment issue, should be fine on a rerun.

liurenjie1024

Thanks @mbutrovich for this fix. While in general this pr is correct, it handles the special case in several places. I have concern with extensibility of the approach in this pr, for example, what if we are going to handle name mapping? Is it possible to refactoring with following method:

If parquet field id not exists, we could build a new parquet schema with assigned field id, as java did: https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java#L96
Create arrow schema with generated parqeut schema.

With this approach, we could keep other parts not changed, WDYT?

crates/iceberg/src/arrow/reader.rs

mbutrovich · 2025-10-27T13:02:31Z

Thanks for the comments @liurenjie1024! Let me take a look today to address your feedback.

mbutrovich · 2025-10-28T17:15:42Z

Thanks again @liurenjie1024! Please let me know if these changes reflect what you had in mind. I'm hoping this design where we pass a custom schema to ArrowReaderOptions will be helpful for future schema transformations like in #1778.

mbutrovich added 5 commits October 21, 2025 20:44

Stash reading parquet files without field ids.

411b884

Format

df57d37

more tests.

2207045

Update comments.

b29a055

Fix struct logic.

ecc805d

mbutrovich added a commit to mbutrovich/datafusion-comet that referenced this pull request Oct 22, 2025

Support migrated tables via apache/iceberg-rust#1777.

236b339

mbutrovich marked this pull request as draft October 22, 2025 10:28

mbutrovich and others added 2 commits October 22, 2025 07:08

Fix error when filtering out all row groups.

71fd876

Merge branch 'main' into field_ids

d7fe8da

mbutrovich mentioned this pull request Oct 22, 2025

feat: Iceberg scan based serializing FileScanTasks to iceberg-rust apache/datafusion-comet#2528

Draft

mbutrovich added 3 commits October 22, 2025 09:35

Schema evolution with column in the middle for migrated table.

17a0ab1

Merge remote-tracking branch 'origin/field_ids' into field_ids

d1e8a50

Refactor build_field_id_map.

9a2bcf6

mbutrovich marked this pull request as ready for review October 22, 2025 18:09

Merge branch 'main' into field_ids

6370014

liurenjie1024 reviewed Oct 27, 2025

View reviewed changes

crates/iceberg/src/arrow/reader.rs Outdated Show resolved Hide resolved

mbutrovich and others added 2 commits October 27, 2025 10:22

Merge branch 'main' into field_ids

80b7e55

Merge branch 'main' into field_ids

e8794e1

mbutrovich mentioned this pull request Oct 28, 2025

fix(reader): Support both position and equality delete files on the same FileScanTask #1778

Open

Address PR feedback.

0f4966d

mbutrovich requested a review from liurenjie1024 October 28, 2025 17:14

Merge branch 'main' into field_ids

955766c

mbutrovich and others added 4 commits October 28, 2025 14:33

Merge branch 'main' into field_ids

2dcf5b8

Merge branch 'main' into field_ids

c2d67ba

Refactor and fix comments.

7c94bf5

Merge branch 'main' into field_ids

6f7e74b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777

feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777

mbutrovich commented Oct 22, 2025 •

edited

Loading

Uh oh!

mbutrovich commented Oct 25, 2025

Uh oh!

liurenjie1024 left a comment

Uh oh!

Uh oh!

mbutrovich commented Oct 27, 2025

Uh oh!

mbutrovich commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777

Are you sure you want to change the base?

feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777

Conversation

mbutrovich commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Uh oh!

mbutrovich commented Oct 25, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbutrovich commented Oct 27, 2025

Uh oh!

mbutrovich commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbutrovich commented Oct 22, 2025 •

edited

Loading