Skip to content

ParquetOpener fails on files without PageIndex metadata #19839

@friendlymatthew

Description

@friendlymatthew

Describe the bug

The ParquetOpener errors when reading parquet files that lack page index metadata during sparse column chunk reads with row selection masks :neckbeard: This manifests as "Invalid offset in sparse column chunk data: [offset], no matching page found" errors

I think the main problem here is using ArrowReaderOptions::with_page_index(true) which internally sets PageIndexPolicy::Required and strictly requires page index metadata to be present. This API was replaced in arrow land with the more flexible PageIndexPolicy enum that expands behavior from 2 boolean states to 3 policy options (Required, Optional, Never)

Related issues

Expected behavior

We should set page index policy to PageIndexPolicy::Optional. This way it gracefully handles files both with and without page index metadata

// Since we're manually loading the page index the option here should not matter but we pass it in for consistency
options.with_page_index(true),

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions