-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_csv for s3 data #8961
read_csv for s3 data #8961
Conversation
python/cudf/cudf/utils/ioutils.py
Outdated
elif _is_s3_filesystem(fs): | ||
fs = pyarrow.fs.S3FileSystem() | ||
path_or_data = fs.open_input_file(paths[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic might need to be moved to csv.py
for now, to ensure this codepath is only followed for read_csv
calls. In the future, this could be implemented here when other readers support this option as well
python/cudf/cudf/utils/ioutils.py
Outdated
elif _is_s3_filesystem(fs): | ||
fs = pyarrow.fs.S3FileSystem() | ||
path_or_data = fs.open_input_file(paths[0]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this since we don't want to go down this codepath for anything that's not read_csv
and this utility is called by other readers as well.
python/cudf/cudf/io/csv.py
Outdated
if _is_s3_filesystem(fs): | ||
fs = pyarrow.fs.S3FileSystem() | ||
filepath_or_buffer = fs.open_input_file(paths[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can directly use the method defined in ioutils
if isinstance(datasource, NativeFile): | ||
datasource = NativeFileDatasource(datasource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into the make_source_info
utility that is called on the next line? It seems that both csv and parquet hit that same function to set up the source information, so doing something like the following in that function should take care of both formats for us:
...
elif isinstance(src[0], (Datasource, NativeFile)):
csrc = NativeFileDatasource(src[0]) if isinstance(src[0], NativeFile) else src[0]
return source_info(csrc.get_datasource())
...
Thanks for working on this @shridharathi - This is really nice to see! Just some thoughts (feel free to ignore): It makes sense to automatically convert to a pyarrow The other option may be to perform the fsspec->arrow filesystem conversion within the |
…csv in cudf (#9304) This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one). **What this PR DOES do**: - Enables users to pass Arrow-based file objects directly to the cudf `read_parquet` and `read_csv` functions. For example: ```python import cudf import pyarrow.fs as pa_fs fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet") with fs.open_input_file(path) as fil: gdf = cudf.read_parquet(fil) ``` - Adds automatic conversion of fsspec `AbstractBufferedFile` objects into Arrow-backed `PythonFile` objects. For `read_parquet`, an Arrow-backed `PythonFile` object can be used (in place of an optimized fsspec transfer) by passing `use_python_file_object=True`: ```python import cudf gdf = cudf.read_parquet(path, use_python_file_object=True) ``` or ```python import cudf from fsspec.core import get_fs_token_paths fs = get_fs_token_paths(path)[0] with fs.open(path, mode="rb") as fil: gdf = cudf.read_parquet(fil, use_python_file_object=True) ``` **What this PR does NOT do**: - cudf will **not** automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead. - cudf will **not** accept NativeFile input for IO functions other than read_csv and read_parquet - dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #9304
This PR has been labeled |
@shridharathi @rjzamora @ayushdg is this PR still something that we're interested in? |
Most of this PR was moved into #9304 (and already merged). The only remaining feature here is the automatic generation of an arrow-backed |
These updates are meant to help to read csv files from Amazon S3 cloud data buckets faster. CSV data is passed down to Cython bindings and using functions from PyArrow, csv data is open and transformed from NativeFile type to shared pointer to RandomAccessFile type to Arrow IO source type to a Datasource.