[FEA] Add option to use experimental parquet reader in cudf-polars

**Is your feature request related to a problem? Please describe.**
As of 25.12, cudf-polars uses the chunked parquet reader in libcudf for parquet TableScan. 

#17896 tracks the development of a new parquet reader API that includes:
* fine-grained control of the reader steps, rather than completing all IO and compute behind one API
* stateless APIs to provide easier retry options
* new filtering options with page stats and dictionary page inspection
* two-stage materialization of filter columns and payload columns

There is a new example of the reader in action over in #19469

**Describe the solution you'd like**
cudf-polars could add an optional TableScan implementation that uses the new APIs. Over time, cudf-polars could take advantage of new features in the reader.

| Project | Scope | Status |
|---|---|---|
| Add pylibcudf bindings for `cudf::io::experimental::hybrid_scan_reader` | May need some development for pylibcudf to handle all of the API patterns in [hybrid_scan.cpp](https://github.com/rapidsai/cudf/blob/branch-25.10/cpp/include/cudf/io/experimental/hybrid_scan.hpp) | (needs triage) |
| Add basic cudf-polars implementation that uses `hybrid_scan` APIs | Could still match the inputs and outputs but use `materialize_filter_columns_chunk` and `materialize_payload_columns_chunk` instead of `read_chunk`. cudf-polars would need to use KvikIO python API to prepare compressed parquet buffers for the new reader. | |
| Add switch to toggle between the two readers | Ideally this reader selection could be made available via PDSH CLI parameter, or environment variable could also work. | |
| Add option to use two-stage or single-stage materialization | materialize filter columns first, and then compute how many payload row groups and pages are needed? or materialize all of the row groups in one pass | | 
| Add option to use column index for page pruning | Initial testing shows it's expensive to compute per-page predicates, but it may be worthwhile depending on query selectivity, page alignment and size, and how much work we do to speed up per-page predicate computation. | | 


**Additional context**
There is also integration work to test the new reader APIs in Velox-cuDF and Spark-RAPIDS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Add option to use experimental parquet reader in cudf-polars #20232

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Project	Scope	Status
Add pylibcudf bindings for `cudf::io::experimental::hybrid_scan_reader`	May need some development for pylibcudf to handle all of the API patterns in hybrid_scan.cpp	(needs triage)
Add basic cudf-polars implementation that uses `hybrid_scan` APIs	Could still match the inputs and outputs but use `materialize_filter_columns_chunk` and `materialize_payload_columns_chunk` instead of `read_chunk`. cudf-polars would need to use KvikIO python API to prepare compressed parquet buffers for the new reader.
Add switch to toggle between the two readers	Ideally this reader selection could be made available via PDSH CLI parameter, or environment variable could also work.
Add option to use two-stage or single-stage materialization	materialize filter columns first, and then compute how many payload row groups and pages are needed? or materialize all of the row groups in one pass
Add option to use column index for page pruning	Initial testing shows it's expensive to compute per-page predicates, but it may be worthwhile depending on query selectivity, page alignment and size, and how much work we do to speed up per-page predicate computation.

[FEA] Add option to use experimental parquet reader in cudf-polars #20232

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions