Skip to content

Conversation

@aviralgarg05
Copy link

Which issue does this PR close?

Rationale for this change

The ParquetOpener was using ArrowReaderOptions::with_page_index(true), which internally sets PageIndexPolicy::Required. This caused sparse column chunk reads with row selection masks to fail with errors like "Invalid offset in sparse column chunk data" when reading Parquet files that lack page index metadata.

Relaxing this policy to PageIndexPolicy::Optional allows DataFusion to gracefully handle files both with and without page index metadata while still leveraging the index when it exists.

What changes are included in this PR?

  • Modified datafusion/datasource-parquet/src/opener.rs to use PageIndexPolicy::Optional instead of Required.
  • Added a new regression test in datafusion/core/tests/parquet/issue_19839.rs that validates reading a Parquet file written without a page index.

Are these changes tested?

Yes. I have added a dedicated regression test case:

  • datafusion/core/tests/parquet/issue_19839.rs

This test writes a Parquet file specifically without page index metadata and verifies that ParquetOpener can read it successfully when parquet_page_index_pruning is enabled.

Are there any user-facing changes?

No. This is a bug fix that improves the robustness of the Parquet reader.

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jan 20, 2026
@aviralgarg05
Copy link
Author

Resolved the issues! @kumarUjjawal

@aviralgarg05 aviralgarg05 force-pushed the fix/parquet-opener-page-index-policy branch from 91e0832 to faffff0 Compare January 20, 2026 10:25
As requested in PR feedback, the regression test for issue apache#19839
has been moved from a dedicated file to the existing page_pruning.rs
test file to keep related tests together.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aviralgarg05 , @kumarUjjawal and @martin-g

This is looking good -- I think we just need to fix the parquet-testing pin and it will be good to go


// Write parquet WITHOUT page index
// The default WriterProperties does not write page index, but we set it explicitly
// to be robust against future changes in defaults as requested by reviewers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 -- I like the comments

@aviralgarg05 aviralgarg05 force-pushed the fix/parquet-opener-page-index-policy branch from 5b1d1c6 to b8410d2 Compare January 23, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ParquetOpener fails on files without PageIndex metadata

4 participants