-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
The ParquetOpener errors when reading parquet files that lack page index metadata during sparse column chunk reads with row selection masks
This manifests as "Invalid offset in sparse column chunk data: [offset], no matching page found" errors
I think the main problem here is using ArrowReaderOptions::with_page_index(true) which internally sets PageIndexPolicy::Required and strictly requires page index metadata to be present. This API was replaced in arrow land with the more flexible PageIndexPolicy enum that expands behavior from 2 boolean states to 3 policy options (Required, Optional, Never)
Related issues
Expected behavior
We should set page index policy to PageIndexPolicy::Optional. This way it gracefully handles files both with and without page index metadata
datafusion/datafusion/datasource-parquet/src/opener.rs
Lines 434 to 435 in 6f92ea6
| // Since we're manually loading the page index the option here should not matter but we pass it in for consistency | |
| options.with_page_index(true), |