Skip to content

[FEA] Add option to use experimental parquet reader in cudf-polars #20232

@GregoryKimball

Description

@GregoryKimball

Is your feature request related to a problem? Please describe.
As of 25.12, cudf-polars uses the chunked parquet reader in libcudf for parquet TableScan.

#17896 tracks the development of a new parquet reader API that includes:

  • fine-grained control of the reader steps, rather than completing all IO and compute behind one API
  • stateless APIs to provide easier retry options
  • new filtering options with page stats and dictionary page inspection
  • two-stage materialization of filter columns and payload columns

There is a new example of the reader in action over in #19469

Describe the solution you'd like
cudf-polars could add an optional TableScan implementation that uses the new APIs. Over time, cudf-polars could take advantage of new features in the reader.

Project Scope Status
Add pylibcudf bindings for cudf::io::experimental::hybrid_scan_reader May need some development for pylibcudf to handle all of the API patterns in hybrid_scan.cpp (needs triage)
Add basic cudf-polars implementation that uses hybrid_scan APIs Could still match the inputs and outputs but use materialize_filter_columns_chunk and materialize_payload_columns_chunk instead of read_chunk. cudf-polars would need to use KvikIO python API to prepare compressed parquet buffers for the new reader.
Add switch to toggle between the two readers Ideally this reader selection could be made available via PDSH CLI parameter, or environment variable could also work.
Add option to use two-stage or single-stage materialization materialize filter columns first, and then compute how many payload row groups and pages are needed? or materialize all of the row groups in one pass
Add option to use column index for page pruning Initial testing shows it's expensive to compute per-page predicates, but it may be worthwhile depending on query selectivity, page alignment and size, and how much work we do to speed up per-page predicate computation.

Additional context
There is also integration work to test the new reader APIs in Velox-cuDF and Spark-RAPIDS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions