Skip to content

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf#9304

Merged
rapids-bot[bot] merged 34 commits intorapidsai:branch-21.12from
rjzamora:native-file-simple
Oct 4, 2021
Merged

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf#9304
rapids-bot[bot] merged 34 commits intorapidsai:branch-21.12from
rjzamora:native-file-simple

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Sep 24, 2021

This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one).

What this PR DOES do:

  • Enables users to pass Arrow-based file objects directly to the cudf read_parquet and read_csv functions. For example:
import cudf
import pyarrow.fs as pa_fs

fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet")
with fs.open_input_file(path) as fil:
    gdf = cudf.read_parquet(fil)
  • Adds automatic conversion of fsspec AbstractBufferedFile objects into Arrow-backed PythonFile objects. For read_parquet, an Arrow-backed PythonFile object can be used (in place of an optimized fsspec transfer) by passing use_python_file_object=True:
import cudf

gdf = cudf.read_parquet(path, use_python_file_object=True)

or

import cudf
from fsspec.core import get_fs_token_paths

fs = get_fs_token_paths(path)[0]
with fs.open(path, mode="rb") as fil:
    gdf = cudf.read_parquet(fil, use_python_file_object=True)

What this PR does NOT do:

  • cudf will not automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead.
  • cudf will not accept NativeFile input for IO functions other than read_csv and read_parquet
  • dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage

Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests).

@rjzamora rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. Cython Performance Performance related issue improvement Improvement / enhancement to an existing function labels Sep 24, 2021
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ approval

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default constructor is not safe.

@github-actions github-actions bot removed the libcudf Affects libcudf (C++/CUDA) code. label Oct 4, 2021
@rjzamora rjzamora removed the 4 - Needs Review Waiting for reviewer to review or respond label Oct 4, 2021
@rjzamora rjzamora requested a review from jrhemstad October 4, 2021 20:44
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 4, 2021
@rjzamora
Copy link
Member Author

rjzamora commented Oct 4, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fb18491 into rapidsai:branch-21.12 Oct 4, 2021
@rjzamora rjzamora deleted the native-file-simple branch October 4, 2021 21:44
rapids-bot bot pushed a commit that referenced this pull request Oct 6, 2021
This is a simple follow-up to #9304 and #9265 meant to achieve the following:

- After this PR, the default behavior of `cudf.read_csv` will be to convert fsspec-based `AbstractBufferedFile` objects to Arrow `PythonFile` objects for non-local file systems. Since `PythonFile` objects inherit from `NativeFile` objects, libcudf can seek/read distinct byte ranges without requiring the entire file to be read into host memory (i.e. the default behavior enables proper partial IO from remote storage)

- #9265 recently added an fsspec-based optimization for transfering csv byte ranges into local memory. That optimization already allowed us to avoid a full file transfer when a specific `byte_range` is specified to the `cudf.read_csv` call.  However, the simpler approach introduced in this PR is (1) more general, (2) easier to maintain, and (3) demonstrates comparable performance. Therefore, this PR also rolls back one of the less-maintainable optimizations added in #9265 (local buffer clipping).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - https://github.com/brandon-b-miller

URL: #9376
rapids-bot bot pushed a commit that referenced this pull request Oct 7, 2021
This is a follow-up to #9304, and is more-or-less the ORC version of #9376

These changes will enable partial IO to behave "correctly" for `cudf.read_orc` from remote storage. Simpe multi-stripe file example:

```python
# After this PR
%time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options)
CPU times: user 579 ms, sys: 166 ms, total: 744 ms
Wall time: 2.38 s

# Before this PR
%time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options)
CPU times: user 3.9 s, sys: 1.47 s, total: 5.37 s
Wall time: 8.5 s
```

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #9377
@rjzamora rjzamora mentioned this pull request Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants