[WIP] Add new dask_cudf.read_parquet
API
#17250
Draft
+727
−191
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
It's time to clean up the
dask_cudf.read_parquet
API and prioritize GPU-specific optimizations. To this end, it makes sense to expose our ownread_parquet
API within Dask cuDF.Notes:
dask_cudf.read_parquet
API is only relevant when query-planning is enabled (the default).filesystem="arrow"
now usescudf.read_parquet
when reading from local storage (rather than PyArrow).blocksize
argument is now specific to the size of the first NVIDIA device on the client machine. More specifically, we usepynvml
, and setblocksize
to be 1/32 the total size of device 0.blocksize
isNone
, we disable partition fusion at optimization time.blocksize
is notNone
, we use the parquet metadata from the first few files to inform partition fusion at optimization time (instead of a rough column-count ratio).Checklist