-
Notifications
You must be signed in to change notification settings - Fork 977
Open
Labels
feature requestNew feature or requestNew feature or request
Description
Is your feature request related to a problem? Please describe.
As of 25.12, cudf-polars uses the chunked parquet reader in libcudf for parquet TableScan.
#17896 tracks the development of a new parquet reader API that includes:
- fine-grained control of the reader steps, rather than completing all IO and compute behind one API
- stateless APIs to provide easier retry options
- new filtering options with page stats and dictionary page inspection
- two-stage materialization of filter columns and payload columns
There is a new example of the reader in action over in #19469
Describe the solution you'd like
cudf-polars could add an optional TableScan implementation that uses the new APIs. Over time, cudf-polars could take advantage of new features in the reader.
Project | Scope | Status |
---|---|---|
Add pylibcudf bindings for cudf::io::experimental::hybrid_scan_reader |
May need some development for pylibcudf to handle all of the API patterns in hybrid_scan.cpp | (needs triage) |
Add basic cudf-polars implementation that uses hybrid_scan APIs |
Could still match the inputs and outputs but use materialize_filter_columns_chunk and materialize_payload_columns_chunk instead of read_chunk . cudf-polars would need to use KvikIO python API to prepare compressed parquet buffers for the new reader. |
|
Add switch to toggle between the two readers | Ideally this reader selection could be made available via PDSH CLI parameter, or environment variable could also work. | |
Add option to use two-stage or single-stage materialization | materialize filter columns first, and then compute how many payload row groups and pages are needed? or materialize all of the row groups in one pass | |
Add option to use column index for page pruning | Initial testing shows it's expensive to compute per-page predicates, but it may be worthwhile depending on query selectivity, page alignment and size, and how much work we do to speed up per-page predicate computation. |
Additional context
There is also integration work to test the new reader APIs in Velox-cuDF and Spark-RAPIDS.
Metadata
Metadata
Assignees
Labels
feature requestNew feature or requestNew feature or request