Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add new dask_cudf.read_parquet API #17250

Draft
wants to merge 8 commits into
base: branch-24.12
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Nov 5, 2024

Description

It's time to clean up the dask_cudf.read_parquet API and prioritize GPU-specific optimizations. To this end, it makes sense to expose our own read_parquet API within Dask cuDF.

Notes:

  • The "new" dask_cudf.read_parquet API is only relevant when query-planning is enabled (the default).
  • Using filesystem="arrow" now uses cudf.read_parquet when reading from local storage (rather than PyArrow).
  • (specific to Dask cuDF): The default blocksize argument is now specific to the size of the first NVIDIA device on the client machine. More specifically, we use pynvml, and set blocksize to be 1/32 the total size of device 0.
  • (specific to Dask cuDF): When blocksize is None, we disable partition fusion at optimization time.
  • (specific to Dask cuDF): When blocksize is not None, we use the parquet metadata from the first few files to inform partition fusion at optimization time (instead of a rough column-count ratio).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added the 2 - In Progress Currently a work in progress label Nov 5, 2024
@rjzamora rjzamora self-assigned this Nov 5, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 5, 2024
@rjzamora rjzamora added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant