-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix/parquet opener page index policy #19890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix/parquet opener page index policy #19890
Conversation
|
Resolved the issues! @kumarUjjawal |
91e0832 to
faffff0
Compare
As requested in PR feedback, the regression test for issue apache#19839 has been moved from a dedicated file to the existing page_pruning.rs test file to keep related tests together.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aviralgarg05 , @kumarUjjawal and @martin-g
This is looking good -- I think we just need to fix the parquet-testing pin and it will be good to go
|
|
||
| // Write parquet WITHOUT page index | ||
| // The default WriterProperties does not write page index, but we set it explicitly | ||
| // to be robust against future changes in defaults as requested by reviewers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 -- I like the comments
5b1d1c6 to
b8410d2
Compare
Which issue does this PR close?
ParquetOpenerfails on files withoutPageIndexmetadata #19839.Rationale for this change
The ParquetOpener was using
ArrowReaderOptions::with_page_index(true), which internally setsPageIndexPolicy::Required. This caused sparse column chunk reads with row selection masks to fail with errors like "Invalid offset in sparse column chunk data" when reading Parquet files that lack page index metadata.Relaxing this policy to
PageIndexPolicy::Optionalallows DataFusion to gracefully handle files both with and without page index metadata while still leveraging the index when it exists.What changes are included in this PR?
PageIndexPolicy::Optionalinstead ofRequired.Are these changes tested?
Yes. I have added a dedicated regression test case:
This test writes a Parquet file specifically without page index metadata and verifies that ParquetOpener can read it successfully when
parquet_page_index_pruningis enabled.Are there any user-facing changes?
No. This is a bug fix that improves the robustness of the Parquet reader.