Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Oct 22, 2025

What issue does this PR close?

Partially address #1749.

Rationale for this change

Background: This issue was discovered when running Iceberg Java's test suite against our experimental DataFusion Comet branch that uses iceberg-rust. Many failures occurred in TestMigrateTableAction.java, which tests reading Parquet files from migrated tables (e.g., from Hive or Spark) that lack embedded field ID metadata.

Problem: The Rust ArrowReader was unable to read these files, while Iceberg Java handles them using a position-based fallback where top-level field ID N maps to top-level Parquet column position N-1, and entire columns (including nested content) are projected.

What changes are included in this PR?

This PR implements position-based column projection for Parquet files without field IDs, enabling iceberg-rust to read migrated tables.

Solution: Implemented fallback projection in ArrowReader::get_arrow_projection_mask_fallback() that matches Java's ParquetSchemaUtil.pruneColumnsFallback() behavior:

  • Detects Parquet files without field IDs by checking Arrow schema metadata
  • Maps top-level field IDs to top-level column positions (field IDs are 1-indexed, positions are 0-indexed)
  • Uses ProjectionMask::roots() to project entire columns including nested content (structs, lists, maps)
  • Adds field ID metadata to the projected schema for RecordBatchTransformer
  • Supports schema evolution by allowing missing columns (filled with default values by RecordBatchTransformer)

This implementation now matches Iceberg Java's behavior for reading migrated tables, enabling interoperability with Java-based tooling and workflows.

Are these changes tested?

Yes, comprehensive unit tests were added to verify the fallback path works correctly:

  • test_read_parquet_file_without_field_ids - Basic projection with primitive columns using position-based mapping
  • test_read_parquet_without_field_ids_partial_projection - Project subset of columns
  • test_read_parquet_without_field_ids_schema_evolution - Handle missing columns with NULL values
  • test_read_parquet_without_field_ids_multiple_row_groups - Verify behavior across row group boundaries
  • test_read_parquet_without_field_ids_with_struct - Project structs with nested fields (entire top-level column)
  • test_read_parquet_without_field_ids_filter_eliminates_all_rows - Comet saw a panic when all row groups were filtered out, this reproduces that scenario
  • test_read_parquet_without_field_ids_schema_evolution_add_column_in_middle - Schema evolution with a column in the middle caused a panic at one point

All tests verify that behavior matches Iceberg Java's pruneColumnsFallback() implementation in
/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java.

mbutrovich added a commit to mbutrovich/datafusion-comet that referenced this pull request Oct 22, 2025
@mbutrovich mbutrovich marked this pull request as draft October 22, 2025 10:28
@mbutrovich mbutrovich marked this pull request as ready for review October 22, 2025 18:09
@mbutrovich
Copy link
Contributor Author

CI failure looks like an environment issue, should be fine on a rerun.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich for this fix. While in general this pr is correct, it handles the special case in several places. I have concern with extensibility of the approach in this pr, for example, what if we are going to handle name mapping? Is it possible to refactoring with following method:

  1. If parquet field id not exists, we could build a new parquet schema with assigned field id, as java did: https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java#L96
  2. Create arrow schema with generated parqeut schema.

With this approach, we could keep other parts not changed, WDYT?

@mbutrovich
Copy link
Contributor Author

Thanks for the comments @liurenjie1024! Let me take a look today to address your feedback.

@mbutrovich
Copy link
Contributor Author

Thanks again @liurenjie1024! Please let me know if these changes reflect what you had in mind. I'm hoping this design where we pass a custom schema to ArrowReaderOptions will be helpful for future schema transformations like in #1778.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants