-
Notifications
You must be signed in to change notification settings - Fork 338
feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
CI failure looks like an environment issue, should be fine on a rerun. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mbutrovich for this fix. While in general this pr is correct, it handles the special case in several places. I have concern with extensibility of the approach in this pr, for example, what if we are going to handle name mapping? Is it possible to refactoring with following method:
- If parquet field id not exists, we could build a new parquet schema with assigned field id, as java did: https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java#L96
- Create arrow schema with generated parqeut schema.
With this approach, we could keep other parts not changed, WDYT?
|
Thanks for the comments @liurenjie1024! Let me take a look today to address your feedback. |
|
Thanks again @liurenjie1024! Please let me know if these changes reflect what you had in mind. I'm hoping this design where we pass a custom schema to |
What issue does this PR close?
Partially address #1749.
Rationale for this change
Background: This issue was discovered when running Iceberg Java's test suite against our experimental DataFusion Comet branch that uses iceberg-rust. Many failures occurred in
TestMigrateTableAction.java, which tests reading Parquet files from migrated tables (e.g., from Hive or Spark) that lack embedded field ID metadata.Problem: The Rust ArrowReader was unable to read these files, while Iceberg Java handles them using a position-based fallback where top-level field ID N maps to top-level Parquet column position N-1, and entire columns (including nested content) are projected.
What changes are included in this PR?
This PR implements position-based column projection for Parquet files without field IDs, enabling iceberg-rust to read migrated tables.
Solution: Implemented fallback projection in
ArrowReader::get_arrow_projection_mask_fallback()that matches Java'sParquetSchemaUtil.pruneColumnsFallback()behavior:ProjectionMask::roots()to project entire columns including nested content (structs, lists, maps)RecordBatchTransformerRecordBatchTransformer)This implementation now matches Iceberg Java's behavior for reading migrated tables, enabling interoperability with Java-based tooling and workflows.
Are these changes tested?
Yes, comprehensive unit tests were added to verify the fallback path works correctly:
test_read_parquet_file_without_field_ids- Basic projection with primitive columns using position-based mappingtest_read_parquet_without_field_ids_partial_projection- Project subset of columnstest_read_parquet_without_field_ids_schema_evolution- Handle missing columns with NULL valuestest_read_parquet_without_field_ids_multiple_row_groups- Verify behavior across row group boundariestest_read_parquet_without_field_ids_with_struct- Project structs with nested fields (entire top-level column)test_read_parquet_without_field_ids_filter_eliminates_all_rows- Comet saw a panic when all row groups were filtered out, this reproduces that scenariotest_read_parquet_without_field_ids_schema_evolution_add_column_in_middle- Schema evolution with a column in the middle caused a panic at one pointAll tests verify that behavior matches Iceberg Java's
pruneColumnsFallback()implementation in/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java.