Skip to content

Conversation

@mbutrovich
Copy link
Contributor

Which issue does this PR close?

    /// This reproduces the scenario from Iceberg Java's TestAddFilesProcedure where:
    /// - Hive-style partitioned Parquet files are imported via add_files procedure
    /// - Parquet files have field IDs: name (1), subdept (2)
    /// - Iceberg schema assigns different field IDs: id (1), name (2), dept (3), subdept (4)
    /// - Partition columns (id, dept) have initial_default values from manifests
    ///
    /// Without proper handling, this would incorrectly:
    /// 1. Try to read partition column "id" (field_id=1) from Parquet field_id=1 ("name")
    /// 2. Read data column "name" (field_id=2) from Parquet field_id=2 ("subdept")
    ///
    /// The fix ensures:
    /// 1. Partition columns with initial_default are ALWAYS read as constants (never from Parquet)
    /// 2. Data columns use name-based mapping when field ID conflicts are detected

What changes are included in this PR?

  • Detect conflict in field ID mappings and resolve similar to Iceberg Java BaseParquetReaders.java PartitionUtil.constantsMap()

Are these changes tested?

@mbutrovich mbutrovich marked this pull request as draft October 30, 2025 01:37
@mbutrovich
Copy link
Contributor Author

Draft while I review some new Iceberg Java failures this created for me.

@mbutrovich
Copy link
Contributor Author

I think I'll close this in favor of a more comprehensive fix that handles partition specs correctly.

@mbutrovich mbutrovich closed this Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant