Skip to content

Commit aa8c0c4

Browse files
Implement cudf-polars chunked parquet reading (#16944)
This PR provides access to the libcudf chunked parquet reader through the `cudf-polars` gpu engine, inspired by the cuDF python implementation. Closes #16818 Authors: - https://github.com/brandon-b-miller - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Lawrence Mitchell (https://github.com/wence-) URL: #16944
1 parent d475dca commit aa8c0c4

File tree

11 files changed

+297
-66
lines changed

11 files changed

+297
-66
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# GPUEngine Configuration Options
2+
3+
The `polars.GPUEngine` object may be configured in several different ways.
4+
5+
## Parquet Reader Options
6+
Reading large parquet files can use a large amount of memory, especially when the files are compressed. This may lead to out of memory errors for some workflows. To mitigate this, the "chunked" parquet reader may be selected. When enabled, parquet files are read in chunks, limiting the peak memory usage at the cost of a small drop in performance.
7+
8+
9+
To configure the parquet reader, we provide a dictionary of options to the `parquet_options` keyword of the `GPUEngine` object. Valid keys and values are:
10+
- `chunked` indicates that chunked parquet reading is to be used. By default, chunked reading is turned on.
11+
- [`chunk_read_limit`](https://docs.rapids.ai/api/libcudf/legacy/classcudf_1_1io_1_1chunked__parquet__reader#aad118178b7536b7966e3325ae1143a1a) controls the maximum size per chunk. By default, the maximum chunk size is unlimited.
12+
- [`pass_read_limit`](https://docs.rapids.ai/api/libcudf/legacy/classcudf_1_1io_1_1chunked__parquet__reader#aad118178b7536b7966e3325ae1143a1a) controls the maximum memory used for decompression. The default pass read limit is 16GiB.
13+
14+
For example, to select the chunked reader with custom values for `pass_read_limit` and `chunk_read_limit`:
15+
```python
16+
engine = GPUEngine(
17+
parquet_options={
18+
'chunked': True,
19+
'chunk_read_limit': int(1e9),
20+
'pass_read_limit': int(4e9)
21+
}
22+
)
23+
result = query.collect(engine=engine)
24+
```
25+
Note that passing `chunked: False` disables chunked reading entirely, and thus `chunk_read_limit` and `pass_read_limit` will have no effect.

docs/cudf/source/cudf_polars/index.rst

+6
Original file line numberDiff line numberDiff line change
@@ -39,3 +39,9 @@ Launch on Google Colab
3939
:target: https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/accelerated_data_processing_examples/polars_gpu_engine_demo.ipynb
4040

4141
Try out the GPU engine for Polars in a free GPU notebook environment. Sign in with your Google account and `launch the demo on Colab <https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/accelerated_data_processing_examples/polars_gpu_engine_demo.ipynb>`__.
42+
43+
.. toctree::
44+
:maxdepth: 1
45+
:caption: Engine Config Options:
46+
47+
engine_options

python/cudf_polars/cudf_polars/callback.py

+34-6
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,7 @@ def set_device(device: int | None) -> Generator[int, None, None]:
129129

130130
def _callback(
131131
ir: IR,
132+
config: GPUEngine,
132133
with_columns: list[str] | None,
133134
pyarrow_predicate: str | None,
134135
n_rows: int | None,
@@ -145,7 +146,30 @@ def _callback(
145146
set_device(device),
146147
set_memory_resource(memory_resource),
147148
):
148-
return ir.evaluate(cache={}).to_polars()
149+
return ir.evaluate(cache={}, config=config).to_polars()
150+
151+
152+
def validate_config_options(config: dict) -> None:
153+
"""
154+
Validate the configuration options for the GPU engine.
155+
156+
Parameters
157+
----------
158+
config
159+
Configuration options to validate.
160+
161+
Raises
162+
------
163+
ValueError
164+
If the configuration contains unsupported options.
165+
"""
166+
if unsupported := (config.keys() - {"raise_on_fail", "parquet_options"}):
167+
raise ValueError(
168+
f"Engine configuration contains unsupported settings: {unsupported}"
169+
)
170+
assert {"chunked", "chunk_read_limit", "pass_read_limit"}.issuperset(
171+
config.get("parquet_options", {})
172+
)
149173

150174

151175
def execute_with_cudf(nt: NodeTraverser, *, config: GPUEngine) -> None:
@@ -174,10 +198,8 @@ def execute_with_cudf(nt: NodeTraverser, *, config: GPUEngine) -> None:
174198
device = config.device
175199
memory_resource = config.memory_resource
176200
raise_on_fail = config.config.get("raise_on_fail", False)
177-
if unsupported := (config.config.keys() - {"raise_on_fail"}):
178-
raise ValueError(
179-
f"Engine configuration contains unsupported settings {unsupported}"
180-
)
201+
validate_config_options(config.config)
202+
181203
with nvtx.annotate(message="ConvertIR", domain="cudf_polars"):
182204
translator = Translator(nt)
183205
ir = translator.translate_ir()
@@ -200,5 +222,11 @@ def execute_with_cudf(nt: NodeTraverser, *, config: GPUEngine) -> None:
200222
raise exception
201223
else:
202224
nt.set_udf(
203-
partial(_callback, ir, device=device, memory_resource=memory_resource)
225+
partial(
226+
_callback,
227+
ir,
228+
config,
229+
device=device,
230+
memory_resource=memory_resource,
231+
)
204232
)

0 commit comments

Comments
 (0)