[WIP] Add new `dask_cudf.read_parquet` API #17250

rjzamora · 2024-11-05T21:36:53Z

Description

It's time to clean up the dask_cudf.read_parquet API and prioritize GPU-specific optimizations. To this end, it makes sense to expose our own read_parquet API within Dask cuDF.

Notes:

The "new" dask_cudf.read_parquet API is only relevant when query-planning is enabled (the default).
Using filesystem="arrow" now uses cudf.read_parquet when reading from local storage (rather than PyArrow).
(specific to Dask cuDF): The default blocksize argument is now specific to the size of the first NVIDIA device on the client machine. More specifically, we use pynvml, and set blocksize to be 1/32 the total size of device 0.
(specific to Dask cuDF): When blocksize is None, we disable partition fusion at optimization time.
(specific to Dask cuDF): When blocksize is not None, we use the parquet metadata from the first few files to inform partition fusion at optimization time (instead of a rough column-count ratio).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…rquet-api

rjzamora added 2 commits November 5, 2024 12:56

add new read_parquet API to dask_cudf

3e853f8

fix non-expr deprecation

b30c529

rjzamora added the 2 - In Progress Currently a work in progress label Nov 5, 2024

rjzamora self-assigned this Nov 5, 2024

github-actions bot added the Python Affects Python cuDF API. label Nov 5, 2024

rjzamora added 4 commits November 5, 2024 16:33

fix CudfReadParquetFSSpec fusion

e3c640a

correct for aggregate_files=False

e482026

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

b9af7b7

…rquet-api

update default blocksize, and add docstring

2ad1867

rjzamora added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 6, 2024

rjzamora added 2 commits November 6, 2024 11:35

Merge branch 'branch-24.12' into new-read-parquet-api

6c37a9c

Merge branch 'branch-24.12' into new-read-parquet-api

53dfbdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add new `dask_cudf.read_parquet` API #17250

[WIP] Add new `dask_cudf.read_parquet` API #17250

rjzamora commented Nov 5, 2024 •

edited

Loading

[WIP] Add new dask_cudf.read_parquet API #17250

Are you sure you want to change the base?

[WIP] Add new dask_cudf.read_parquet API #17250

Conversation

rjzamora commented Nov 5, 2024 • edited Loading

Description

Checklist

[WIP] Add new `dask_cudf.read_parquet` API #17250

[WIP] Add new `dask_cudf.read_parquet` API #17250

rjzamora commented Nov 5, 2024 •

edited

Loading