read_csv for s3 data #8961

shridharathi · 2021-08-04T23:52:20Z

These updates are meant to help to read csv files from Amazon S3 cloud data buckets faster. CSV data is passed down to Cython bindings and using functions from PyArrow, csv data is open and transformed from NativeFile type to shared pointer to RandomAccessFile type to Arrow IO source type to a Datasource.

…branch-21.08

ayushdg · 2021-08-05T19:09:31Z

python/cudf/cudf/utils/ioutils.py

+        elif _is_s3_filesystem(fs):
+            fs = pyarrow.fs.S3FileSystem()
+            path_or_data = fs.open_input_file(paths[0]) 


This logic might need to be moved to csv.py for now, to ensure this codepath is only followed for read_csv calls. In the future, this could be implemented here when other readers support this option as well

ayushdg · 2021-08-09T19:02:51Z

python/cudf/cudf/utils/ioutils.py

+        elif _is_s3_filesystem(fs):
+            fs = pyarrow.fs.S3FileSystem()
+            path_or_data = fs.open_input_file(paths[0])
+


Can remove this since we don't want to go down this codepath for anything that's not read_csv and this utility is called by other readers as well.

ayushdg · 2021-08-09T19:03:49Z

python/cudf/cudf/io/csv.py

+    if _is_s3_filesystem(fs):
+        fs = pyarrow.fs.S3FileSystem()
+        filepath_or_buffer = fs.open_input_file(paths[0])


Can directly use the method defined in ioutils

rjzamora · 2021-09-08T22:37:08Z

python/cudf/cudf/_lib/csv.pyx

+    if isinstance(datasource, NativeFile):
+        datasource = NativeFileDatasource(datasource)


Can we move this into the make_source_info utility that is called on the next line? It seems that both csv and parquet hit that same function to set up the source information, so doing something like the following in that function should take care of both formats for us:

... elif isinstance(src[0], (Datasource, NativeFile)): csrc = NativeFileDatasource(src[0]) if isinstance(src[0], NativeFile) else src[0] return source_info(csrc.get_datasource()) ...

rjzamora · 2021-09-08T22:48:34Z

Thanks for working on this @shridharathi - This is really nice to see!

Just some thoughts (feel free to ignore):

It makes sense to automatically convert to a pyarrow S3FileSystem. However, given that it might be tricky to preserve all the possible storage_options used to generate the origianl fsspec-filesystem object, it may also make sense to allow the user to pass in an explicit filesystem object to use and/or allow the user to pass in a NativeFile object directly. This way, we make it the users responsibility to define the filesystem object correctly.

The other option may be to perform the fsspec->arrow filesystem conversion within the get_filepath_or_buffer utility. At that point, we can inspect the storage_options specified by the user, and warn the user if a pyarrow-backed filesystem couldn't be generated.

@shridharathi

…csv in cudf (#9304) This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one). **What this PR DOES do**: - Enables users to pass Arrow-based file objects directly to the cudf `read_parquet` and `read_csv` functions. For example: ```python import cudf import pyarrow.fs as pa_fs fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet") with fs.open_input_file(path) as fil: gdf = cudf.read_parquet(fil) ``` - Adds automatic conversion of fsspec `AbstractBufferedFile` objects into Arrow-backed `PythonFile` objects. For `read_parquet`, an Arrow-backed `PythonFile` object can be used (in place of an optimized fsspec transfer) by passing `use_python_file_object=True`: ```python import cudf gdf = cudf.read_parquet(path, use_python_file_object=True) ``` or ```python import cudf from fsspec.core import get_fs_token_paths fs = get_fs_token_paths(path)[0] with fs.open(path, mode="rb") as fil: gdf = cudf.read_parquet(fil, use_python_file_object=True) ``` **What this PR does NOT do**: - cudf will **not** automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead. - cudf will **not** accept NativeFile input for IO functions other than read_csv and read_parquet - dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #9304

github-actions · 2021-11-15T21:04:52Z

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

vyasr · 2022-01-26T19:32:43Z

@shridharathi @rjzamora @ayushdg is this PR still something that we're interested in?

rjzamora · 2022-01-31T15:59:49Z

Most of this PR was moved into #9304 (and already merged). The only remaining feature here is the automatic generation of an arrow-backed S3FileSystem (current behavior is to simply wrap the default fsspec-based filesystem). My impression is that automatic S3FileSystem is not something we want to maintain in cudf. I suggestion we close this, and leave it up to the user to pass an open file if the fsspec -> pyarrow approach does not suite their needs.

sft-managed and others added 5 commits June 24, 2021 22:59

Added cp check for list like feature

664712e

Merge branch 'branch-21.08' of https://github.com/rapidsai/cudf into …

0ea6bb4

…branch-21.08

s3 via pyarrow

688e85c

updated s3 changes

509f91e

clean-up s3 changes

9e2f24f

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 4, 2021

shridharathi added the 2 - In Progress Currently a work in progress label Aug 5, 2021

ayushdg reviewed Aug 5, 2021

View reviewed changes

formatting updates and core s3 logic moved to csv.py

afaed02

ayushdg requested changes Aug 9, 2021

View reviewed changes

shridharathi added 6 commits August 9, 2021 19:11

more changes

f1607d6

more changes

3260c86

final style edits in cython

3fd69b4

isort changes

db3dfce

revert to 3fd69b4

cc1422d

final isort changes

0bd94bb

github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. labels Aug 28, 2021

shridharathi marked this pull request as ready for review August 30, 2021 18:12

shridharathi requested review from a team as code owners August 30, 2021 18:12

shridharathi requested review from robertmaynard, hyperbolic2346, shwina and brandon-b-miller August 30, 2021 18:12

shridharathi marked this pull request as draft August 30, 2021 18:39

Merge branch-21.10 into s3

f8b3905

shridharathi force-pushed the s3 branch from 494ec64 to f8b3905 Compare August 31, 2021 22:33

github-actions bot removed Java Affects Java cuDF API. CMake CMake build issue gpuCI labels Aug 31, 2021

shridharathi added 2 commits September 1, 2021 17:15

csv.py update to handle strings

06f995c

black style update to csv.py

1cc4718

ayushdg mentioned this pull request Sep 3, 2021

[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475

Closed

rjzamora reviewed Sep 8, 2021

View reviewed changes

rjzamora mentioned this pull request Sep 13, 2021

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

Closed

rjzamora mentioned this pull request Sep 24, 2021

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

Merged

github-actions bot added the inactive-30d label Nov 15, 2021

rjzamora closed this Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv for s3 data #8961

read_csv for s3 data #8961

shridharathi commented Aug 4, 2021

ayushdg Aug 5, 2021

ayushdg Aug 9, 2021

ayushdg Aug 9, 2021

rjzamora Sep 8, 2021

rjzamora commented Sep 8, 2021

github-actions bot commented Nov 15, 2021

vyasr commented Jan 26, 2022

rjzamora commented Jan 31, 2022

		if isinstance(datasource, NativeFile):
		datasource = NativeFileDatasource(datasource)

read_csv for s3 data #8961

read_csv for s3 data #8961

Conversation

shridharathi commented Aug 4, 2021

ayushdg Aug 5, 2021

Choose a reason for hiding this comment

ayushdg Aug 9, 2021

Choose a reason for hiding this comment

ayushdg Aug 9, 2021

Choose a reason for hiding this comment

rjzamora Sep 8, 2021

Choose a reason for hiding this comment

rjzamora commented Sep 8, 2021

github-actions bot commented Nov 15, 2021

vyasr commented Jan 26, 2022

rjzamora commented Jan 31, 2022