Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv for s3 data #8961

Closed
wants to merge 15 commits into from
Closed

Conversation

shridharathi
Copy link

These updates are meant to help to read csv files from Amazon S3 cloud data buckets faster. CSV data is passed down to Cython bindings and using functions from PyArrow, csv data is open and transformed from NativeFile type to shared pointer to RandomAccessFile type to Arrow IO source type to a Datasource.

@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 4, 2021
@shridharathi shridharathi added the 2 - In Progress Currently a work in progress label Aug 5, 2021
Comment on lines 1162 to 1164
elif _is_s3_filesystem(fs):
fs = pyarrow.fs.S3FileSystem()
path_or_data = fs.open_input_file(paths[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic might need to be moved to csv.py for now, to ensure this codepath is only followed for read_csv calls. In the future, this could be implemented here when other readers support this option as well

Comment on lines 1164 to 1167
elif _is_s3_filesystem(fs):
fs = pyarrow.fs.S3FileSystem()
path_or_data = fs.open_input_file(paths[0])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this since we don't want to go down this codepath for anything that's not read_csv and this utility is called by other readers as well.

Comment on lines 79 to 81
if _is_s3_filesystem(fs):
fs = pyarrow.fs.S3FileSystem()
filepath_or_buffer = fs.open_input_file(paths[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can directly use the method defined in ioutils

@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. labels Aug 28, 2021
@shridharathi shridharathi marked this pull request as ready for review August 30, 2021 18:12
@shridharathi shridharathi requested review from a team as code owners August 30, 2021 18:12
@github-actions github-actions bot removed Java Affects Java cuDF API. CMake CMake build issue gpuCI labels Aug 31, 2021
Comment on lines +120 to +121
if isinstance(datasource, NativeFile):
datasource = NativeFileDatasource(datasource)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into the make_source_info utility that is called on the next line? It seems that both csv and parquet hit that same function to set up the source information, so doing something like the following in that function should take care of both formats for us:

...
    elif isinstance(src[0], (Datasource, NativeFile)):
        csrc  = NativeFileDatasource(src[0]) if isinstance(src[0], NativeFile) else src[0]
        return source_info(csrc.get_datasource())
...

@rjzamora
Copy link
Member

rjzamora commented Sep 8, 2021

Thanks for working on this @shridharathi - This is really nice to see!

Just some thoughts (feel free to ignore):

It makes sense to automatically convert to a pyarrow S3FileSystem. However, given that it might be tricky to preserve all the possible storage_options used to generate the origianl fsspec-filesystem object, it may also make sense to allow the user to pass in an explicit filesystem object to use and/or allow the user to pass in a NativeFile object directly. This way, we make it the users responsibility to define the filesystem object correctly.

The other option may be to perform the fsspec->arrow filesystem conversion within the get_filepath_or_buffer utility. At that point, we can inspect the storage_options specified by the user, and warn the user if a pyarrow-backed filesystem couldn't be generated.

rapids-bot bot pushed a commit that referenced this pull request Oct 4, 2021
…csv in cudf (#9304)

This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one).

**What this PR DOES do**:

- Enables users to pass Arrow-based file objects directly to the cudf `read_parquet` and `read_csv` functions. For example:

```python
import cudf
import pyarrow.fs as pa_fs

fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet")
with fs.open_input_file(path) as fil:
    gdf = cudf.read_parquet(fil)
```

- Adds automatic conversion of fsspec `AbstractBufferedFile` objects into Arrow-backed `PythonFile` objects. For `read_parquet`, an Arrow-backed `PythonFile` object can be used (in place of an optimized fsspec transfer) by passing `use_python_file_object=True`:

```python
import cudf

gdf = cudf.read_parquet(path, use_python_file_object=True)
```

or 

```python
import cudf
from fsspec.core import get_fs_token_paths

fs = get_fs_token_paths(path)[0]
with fs.open(path, mode="rb") as fil:
    gdf = cudf.read_parquet(fil, use_python_file_object=True)
```


**What this PR does NOT do**:

- cudf will **not** automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead.
- cudf will **not** accept NativeFile input for IO functions other than read_csv and read_parquet
- dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage


Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9304
@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@vyasr
Copy link
Contributor

vyasr commented Jan 26, 2022

@shridharathi @rjzamora @ayushdg is this PR still something that we're interested in?

@rjzamora
Copy link
Member

Most of this PR was moved into #9304 (and already merged). The only remaining feature here is the automatic generation of an arrow-backed S3FileSystem (current behavior is to simply wrap the default fsspec-based filesystem). My impression is that automatic S3FileSystem is not something we want to maintain in cudf. I suggestion we close this, and leave it up to the user to pass an open file if the fsspec -> pyarrow approach does not suite their needs.

@rjzamora rjzamora closed this Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants