feat(reader): handle field ID conflicts in RecordBatchTransformer by mbutrovich · Pull Request #1804 · apache/iceberg-rust

mbutrovich · 2025-10-30T00:21:24Z

Which issue does this PR close?

Partially address ArrowReader enhancements for Apache DataFusion Comet #1749. Just gonna copy my comments from the test:

    /// This reproduces the scenario from Iceberg Java's TestAddFilesProcedure where:
    /// - Hive-style partitioned Parquet files are imported via add_files procedure
    /// - Parquet files have field IDs: name (1), subdept (2)
    /// - Iceberg schema assigns different field IDs: id (1), name (2), dept (3), subdept (4)
    /// - Partition columns (id, dept) have initial_default values from manifests
    ///
    /// Without proper handling, this would incorrectly:
    /// 1. Try to read partition column "id" (field_id=1) from Parquet field_id=1 ("name")
    /// 2. Read data column "name" (field_id=2) from Parquet field_id=2 ("subdept")
    ///
    /// The fix ensures:
    /// 1. Partition columns with initial_default are ALWAYS read as constants (never from Parquet)
    /// 2. Data columns use name-based mapping when field ID conflicts are detected

What changes are included in this PR?

Detect conflict in field ID mappings and resolve similar to Iceberg Java BaseParquetReaders.java PartitionUtil.constantsMap()

Are these changes tested?

New test add_files_partition_columns_with_field_id_conflict
This fixed 42 tests in Iceberg Java's spark-extensions TestAddFilesProcedure suite when running with Comet's feat: [iceberg] Native scan by serializing FileScanTasks to iceberg-rust datafusion-comet#2528.

mbutrovich · 2025-10-30T01:37:22Z

Draft while I review some new Iceberg Java failures this created for me.

mbutrovich · 2025-10-30T13:31:40Z

I think I'll close this in favor of a more comprehensive fix that handles partition specs correctly.

mbutrovich added 2 commits October 29, 2025 20:15

Fix field ID conflicts in RecordBatchTransformer.

03c9f4e

Disambiguate partition values.

d85d675

mbutrovich marked this pull request as draft October 30, 2025 01:37

mbutrovich closed this Oct 30, 2025

mbutrovich deleted the field_id_conflicts branch November 3, 2025 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reader): handle field ID conflicts in RecordBatchTransformer#1804

feat(reader): handle field ID conflicts in RecordBatchTransformer#1804
mbutrovich wants to merge 2 commits intoapache:mainfrom
mbutrovich:field_id_conflicts

mbutrovich commented Oct 30, 2025

Uh oh!

mbutrovich commented Oct 30, 2025

Uh oh!

mbutrovich commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbutrovich commented Oct 30, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

mbutrovich commented Oct 30, 2025

Uh oh!

mbutrovich commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant