Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

Merged
merged 34 commits into from
Oct 4, 2021

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Sep 24, 2021

This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one).

What this PR DOES do:

  • Enables users to pass Arrow-based file objects directly to the cudf read_parquet and read_csv functions. For example:
import cudf
import pyarrow.fs as pa_fs

fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet")
with fs.open_input_file(path) as fil:
    gdf = cudf.read_parquet(fil)
  • Adds automatic conversion of fsspec AbstractBufferedFile objects into Arrow-backed PythonFile objects. For read_parquet, an Arrow-backed PythonFile object can be used (in place of an optimized fsspec transfer) by passing use_python_file_object=True:
import cudf

gdf = cudf.read_parquet(path, use_python_file_object=True)

or

import cudf
from fsspec.core import get_fs_token_paths

fs = get_fs_token_paths(path)[0]
with fs.open(path, mode="rb") as fil:
    gdf = cudf.read_parquet(fil, use_python_file_object=True)

What this PR does NOT do:

  • cudf will not automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead.
  • cudf will not accept NativeFile input for IO functions other than read_csv and read_parquet
  • dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage

Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests).

@rjzamora rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. Cython Performance Performance related issue improvement Improvement / enhancement to an existing function labels Sep 24, 2021
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ approval

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default constructor is not safe.

@github-actions github-actions bot removed the libcudf Affects libcudf (C++/CUDA) code. label Oct 4, 2021
@rjzamora rjzamora removed the 4 - Needs Review Waiting for reviewer to review or respond label Oct 4, 2021
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 4, 2021
@rjzamora
Copy link
Member Author

rjzamora commented Oct 4, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fb18491 into rapidsai:branch-21.12 Oct 4, 2021
@rjzamora rjzamora deleted the native-file-simple branch October 4, 2021 21:44
rapids-bot bot pushed a commit that referenced this pull request Oct 6, 2021
This is a simple follow-up to #9304 and #9265 meant to achieve the following:

- After this PR, the default behavior of `cudf.read_csv` will be to convert fsspec-based `AbstractBufferedFile` objects to Arrow `PythonFile` objects for non-local file systems. Since `PythonFile` objects inherit from `NativeFile` objects, libcudf can seek/read distinct byte ranges without requiring the entire file to be read into host memory (i.e. the default behavior enables proper partial IO from remote storage)

- #9265 recently added an fsspec-based optimization for transfering csv byte ranges into local memory. That optimization already allowed us to avoid a full file transfer when a specific `byte_range` is specified to the `cudf.read_csv` call.  However, the simpler approach introduced in this PR is (1) more general, (2) easier to maintain, and (3) demonstrates comparable performance. Therefore, this PR also rolls back one of the less-maintainable optimizations added in #9265 (local buffer clipping).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - https://github.com/brandon-b-miller

URL: #9376
rapids-bot bot pushed a commit that referenced this pull request Oct 7, 2021
This is a follow-up to #9304, and is more-or-less the ORC version of #9376

These changes will enable partial IO to behave "correctly" for `cudf.read_orc` from remote storage. Simpe multi-stripe file example:

```python
# After this PR
%time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options)
CPU times: user 579 ms, sys: 166 ms, total: 744 ms
Wall time: 2.38 s

# Before this PR
%time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options)
CPU times: user 3.9 s, sys: 1.47 s, total: 5.37 s
Wall time: 8.5 s
```

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #9377
@rjzamora rjzamora mentioned this pull request Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants