[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

rjzamora · 2021-09-13T22:14:16Z

Background

The current versions of cudf.read_parquet and dask_cudf.read_parquet are poorly optimized for remote storage (e.g. s3 and gcs). As discussed in #7475, this is because cudf and dask-cudf use fsspec as a universal file-system adapter, and libcudf's parquet logic cannot call into a python-based fsspec file-system object to seek/read specific byte ranges from the remote file. For this reason, cudf/dask-cudf currently call read to copy all contents of each parquet file into a local memory buffer before passing that buffer to libcudf. This is okay if the process calling read_parquet intends to read the entire file, and the file is reasonably small. However, there are other common cases:

CASE 1 - The process is selecting only a subset of parquet column-chunks: If the user has specified a specific set of columns and/or row-groups in the read_parquet call, then it is clearly inefficient to copy the entire file to host memory.
CASE 2 - The parquet file is large: If the file is large, a concurrent gather operation will most-likely outperform a vanilla read operation.

Changes in This PR

This PR builds upon the cpp/cython changes in #8961 to enable the creation and/or processing of Arrow NativeFile objects in cudf.read_parquet. Since Arrow-backed FileSystem definitions do not exist for all remote file-systems (e.g. GCS), this PR also optimizes the transfer of data from remote storage to host memory with Fsspec (taking advantage of the same optimization targeted in fsspec#744. More specifically, we use a local "dummy buffer", and avoid transferring any data that is not actually required by the underlying libcudf parquet read. We also use a concurrent read operation to transfer the bytes of the file in parallel.

Although the fsspec optimization was originally intended as a "temporary" solution for GCS, it is actually more performant (and stable) than the Arrow-NativeFile approach. The only disadvantage of using fsspec is that we still need enough host memory to store the entire parquet file (even if we do not actually populate most of the buffer with remote data).

Experimental API

We introduce the arrow_filesystem= option to both the cudf and dask-cudf read_parquet APIs. This argument is a boolean, with a default value of False. It determines whether a url-based path input should be used to infer an Arrow-based filesystem object. If url-based file-sytem inference fails, both cudf and dask-cudf will fall back to fsspec for file-system handling. We also introduce a legacy_transfer= option (default False) to allow the user to avoid the optimized data-transfer logic in the case that fsspec is used.

Default APIs

df = cudf.read_parquet(<path-or-handle>, arrow_filesystem=False, legacy_transfer=False)
ddf = dask_cudf.read_parquet(<path-or-handle>, arrow_filesystem=False, legacy_transfer=False)

API Notes

arrow_filesystem: Setting this value to True tells cudf/dask-cudf to try to infer an arrow-based filesystem object that will enable random access in libcudf. When the underlying parquet file can be opened as an arrow NativeFile, cudf no longer needs to copy the data into a local host buffer before calling down to libcudf (because libcudf can seek/read from the arrow file object directly).
- Using arrow_filesystem=True currently works in most cases, but does not perform as well as the optimized fsspec data transfer, and still fails in some cases (especially in Dask).
legacy_transfer: Setting this value to True will avoid the new fsspec data-transfer optimization. The option is only included for debugging and comparison, and may be removed before this PR is ready to merge.

…parquet

…ot stable

…parquet

rjzamora · 2021-09-13T22:20:00Z

This work is related to dask#8132 and NVT#1088

rjzamora · 2021-09-13T22:25:57Z

python/cudf/cudf/utils/ioutils.py

+        # We have an fsspec filesystem and a path
+        with fs.open(path_or_fob, mode="rb", cache_type="none") as fob:
+            fob.seek(offset)
+            local_buffer[offset : offset + nbytes] = np.frombuffer(
+                fob.read(nbytes), dtype="b",
+            )
+


@leiterenato - Perhaps you can comment on the most optimal API to read a specific set of bytes in gcs?

Hopefully, we can add those optimizations directly to gcsfs so that a simple fs.read_block(...) call would be optimal here. Note that I am using seek/read for now, since read_block will actually open the file with read-ahead caching, and then call seek/read.

@rjzamora
Currently the tool gcloud alpha storage cp has the most optimized implementation.
Download link.
The source code is in this directory: lib/googlecloudsdk/command_lib/storage/tasks/cp/.
Blog post with more information.

gcsfs supports reading the single block using cat, possibly for multiple blocks in multiple files concurrently. We just need to push fsspec/filesystem_spec#744 over the line for it to be available in all async implementations.
(the method is available for all backends, but, of course, not concurrent if the implementation is not async)

python/cudf/cudf/utils/ioutils.py

jorisvandenbossche · 2021-09-15T12:54:36Z

Although the fsspec optimization was originally intended as a "temporary" solution for GCS, it is actually more performant (and stable) than the Arrow-NativeFile approach. T

Question here: does the libcudf parquet reader do equivalent byte range optimization logic (based on the parquet metadata, which columns/row groups to read, etc) as you implemented here in Python in the "Fsspec Data-transfer Optimization Code" ?
(since you are comparing performance / seeing a difference. I know that the Parquet reader in Arrow C++ has similar optimizations, but I think that's not being used here, except for reading the metadata)

rjzamora · 2021-09-15T18:50:18Z

Question here: does the libcudf parquet reader do equivalent byte range optimization logic (based on the parquet metadata, which columns/row groups to read, etc) as you implemented here in Python in the "Fsspec Data-transfer Optimization Code" ?

I'm not completely sure what optimizations libcudf uses for data access, but it will certainly use partial IO and should try to minimize how much data is read from disk (cc @devavret and @vuule in case they have input here). It seems like you are implying that the Arrow-NativeFile approach should have similar (if not better) performance than fsspec if the backend is using the NativeFile efficiently. If so, I agree with you :)

devavret · 2021-09-15T20:05:12Z

does the libcudf parquet reader do equivalent byte range optimization logic (based on the parquet metadata, which columns/row groups to read, etc)

libcudf has the options to specify columns and rowgroups and only reads the ones selected.

jorisvandenbossche · 2021-09-15T20:31:08Z

It seems like you are implying that the Arrow-NativeFile approach should have similar (if not better) performance than fsspec if the backend is using the NativeFile efficiently. If so, I agree with you :)

Yes, and so my question was to try to understand what would be the cause (and thus where potential improvements could be made): are there things we can improve in the FileSystem/RandomAccessFile interface (in Arrow), or is it because the Parquet reader in libcudf can do more optimizations in what it asks from the file? (eg it might not do all optimizations as you now implemented in python for fsspec, and then it's not necessarily the filesystem interface (fsspec or arrow) that's the cause for a difference in performance).

libcudf has the options to specify columns and rowgroups and only reads the ones selected.

Just to point out that "only reads the ones selected" can be a bit ambiguous (not knowing the library): only deserializing the requested columns/row groups from parquet into libcudf data structures vs actually only downloading he subet of bytes of the file that are needed to deserialize the columns/row groups.

devavret · 2021-09-15T20:38:20Z

Just to point out that "only reads the ones selected" can be a bit ambiguous (not knowing the library): only deserializing the requested columns/row groups from parquet into libcudf data structures vs actually only downloading he subet of bytes of the file that are needed to deserialize the columns/row groups.

Apologies, I see why that would be ambiguous. What I meant was that libcudf does not depend on the entire file's contents being available to it. We have a datasource class that can be used to make your own classes that implement a read(offset, size, *dst) method. As long as your custom datasource can understand this and copy the result in *dst, it works.

python/cudf/cudf/utils/ioutils.py

…parquet

This PR strips the pyarrow-NativeFile component out of #9225 (since those changes are not yet stable). I feel that it is reasonable to start by merging these fsspec-specific optimizations for 21.10, because they are stable and already result in a significant performance boost over the existing approach to remote storage. I still think it is very important that we eventually plumb NativeFile support into python (cudf and dask_cudf), but we will likely need to target 21.12 for that improvement. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Ashwin Srinath (https://github.com/shwina) - Benjamin Zaitlen (https://github.com/quasiben) URL: #9265

codecov · 2021-09-22T22:07:26Z

Codecov Report

Merging #9225 (587ee5b) into branch-21.12 (ab4bfaa) will increase coverage by 0.04%.
The diff coverage is 0.00%.

@@               Coverage Diff                @@
##           branch-21.12    #9225      +/-   ##
================================================
+ Coverage         10.79%   10.83%   +0.04%     
================================================
  Files               116      116              
  Lines             18869    19260     +391     
================================================
+ Hits               2036     2087      +51     
- Misses            16833    17173     +340

Impacted Files	Coverage Δ
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/hdf.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/dlpack.py	`0.00% <0.00%> (ø)`
... and 50 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2718443...587ee5b. Read the comment docs.

jorisvandenbossche · 2021-09-23T12:26:34Z

libcudf does not depend on the entire file's contents being available to it. We have a datasource class that can be used to make your own classes that implement a read(offset, size, *dst) method. As long as your custom datasource can understand this and copy the result in *dst, it works.

@devavret thanks for the clarification. One other aspect to point out is that the optimizations that @rjzamora added here (moved to #9265 now, I think) go further than just reading the required bytes. For example, it will also merge small / adjacent ranges to read to decrease the number of requests.

devavret · 2021-09-23T12:32:17Z

it will also merge small / adjacent ranges to read to decrease the number of requests

We already have that

cudf/cpp/src/io/parquet/reader_impl.cu

Line 922 in d069d7e

// Transfer chunk data, coalescing adjacent chunks

Although we have plans to move this logic out of format specific readers and into a general reader class that will look for these optimizations.

jorisvandenbossche · 2021-09-23T12:46:20Z

OK, good to know. In that case, it would actually be interesting to understand where the performance difference between libcudf and fsspec comes from (since both should be doing similar optimizations then).

martindurant · 2021-09-23T12:47:09Z

we have plans to move this logic out of format specific readers

It's actually something that fsspec would appreciate! Could be upstreamed?

devavret · 2021-09-23T15:29:03Z

it would actually be interesting to understand where the performance difference between libcudf and fsspec comes from

One reason is that the fsspec optimization added in #9265 only deals with the reading from file to host memory whereas the optimization in libcudf is primarily to coalesce transfers from host memory to GPU memory.

The fsspec optimization only kicks in when the file is not local, in which case it reads the data into host memory and passes to libcudf. In this case, the benefits from #9265 and libcudf are additive.

In case of a local filesystem, libcudf effectively does both the reads (disk -> host, host -> device) and the coalescing is the same for both transfers.

…parquet

@shridharathi

…csv in cudf (#9304) This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one). **What this PR DOES do**: - Enables users to pass Arrow-based file objects directly to the cudf `read_parquet` and `read_csv` functions. For example: ```python import cudf import pyarrow.fs as pa_fs fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet") with fs.open_input_file(path) as fil: gdf = cudf.read_parquet(fil) ``` - Adds automatic conversion of fsspec `AbstractBufferedFile` objects into Arrow-backed `PythonFile` objects. For `read_parquet`, an Arrow-backed `PythonFile` object can be used (in place of an optimized fsspec transfer) by passing `use_python_file_object=True`: ```python import cudf gdf = cudf.read_parquet(path, use_python_file_object=True) ``` or ```python import cudf from fsspec.core import get_fs_token_paths fs = get_fs_token_paths(path)[0] with fs.open(path, mode="rb") as fil: gdf = cudf.read_parquet(fil, use_python_file_object=True) ``` **What this PR does NOT do**: - cudf will **not** automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead. - cudf will **not** accept NativeFile input for IO functions other than read_csv and read_parquet - dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #9304

rjzamora · 2021-10-18T15:22:36Z

This PR was effectrively replaced with #9265, #9377, #9304 and #9376

rjzamora added 10 commits September 9, 2021 13:31

save work related to byte-range collection

82ab31d

enable byte_ranges optimization for open file-like

ef02f3d

fix bug for no column or row-group selection

a32a7ae

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

26b96f3

…parquet

add arrow_filesystem flag for dask_cudf

bd2e59a

use cat_ranges when available

53bb32e

expose arrow_filesystem and legacy_transfer

ada8451

most tests passing with reasonable defaults - arrow_filesystem=True n…

661ae58

…ot stable

fix bug

42c55c9

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

51021ab

…parquet

rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. dask Dask issue Cython labels Sep 13, 2021

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 13, 2021

rjzamora added the non-breaking Non-breaking change label Sep 13, 2021

rjzamora commented Sep 13, 2021

View reviewed changes

jorisvandenbossche reviewed Sep 15, 2021

View reviewed changes

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

legacy_transfer fix

ddec4df

rjzamora added the improvement Improvement / enhancement to an existing function label Sep 15, 2021

rjzamora added 3 commits September 15, 2021 19:32

fix test failures

36e4c52

plumb in csv support since most of the work is already done

98efb58

remove unncessary BytesIO usage for optimized code path

63dd615

rjzamora mentioned this pull request Sep 16, 2021

Add optimized read_parquet path for remote storage NVIDIA-Merlin/NVTabular#1119

Merged

rjzamora commented Sep 20, 2021

View reviewed changes

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

rjzamora added 6 commits September 20, 2021 10:47

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

40639c2

…parquet

avoid memory leaks in _read_byte_ranges

5524538

avoid full-file transfer for read_csv with byte_range defined

fd2998a

avoid seeking before beginning of file

d1cb7a6

remove arrow_filesystem option from dask (for now)

491c69f

save state

5994fd9

rjzamora mentioned this pull request Sep 21, 2021

Optimized fsspec data transfer for remote file-systems #9265

Merged

rjzamora self-assigned this Sep 22, 2021

simplify PR to require NativeFile input (no more uri inference for now)

a821ba7

Merge remote-tracking branch 'upstream/branch-21.12' into nativefile-…

587ee5b

…parquet

rjzamora changed the base branch from branch-21.10 to branch-21.12 September 24, 2021 15:17

rjzamora mentioned this pull request Sep 24, 2021

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

Merged

rjzamora closed this Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

rjzamora commented Sep 13, 2021

rjzamora commented Sep 13, 2021 •

edited

Loading

rjzamora Sep 13, 2021

leiterenato Sep 15, 2021

martindurant Sep 20, 2021 •

edited

Loading

jorisvandenbossche commented Sep 15, 2021

rjzamora commented Sep 15, 2021

devavret commented Sep 15, 2021

jorisvandenbossche commented Sep 15, 2021

devavret commented Sep 15, 2021

codecov bot commented Sep 22, 2021 •

edited

Loading

jorisvandenbossche commented Sep 23, 2021

devavret commented Sep 23, 2021

jorisvandenbossche commented Sep 23, 2021

martindurant commented Sep 23, 2021

devavret commented Sep 23, 2021

rjzamora commented Oct 18, 2021

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

Conversation

rjzamora commented Sep 13, 2021

Background

Changes in This PR

Experimental API

Default APIs

API Notes

rjzamora commented Sep 13, 2021 • edited Loading

rjzamora Sep 13, 2021

Choose a reason for hiding this comment

leiterenato Sep 15, 2021

Choose a reason for hiding this comment

martindurant Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 15, 2021

rjzamora commented Sep 15, 2021

devavret commented Sep 15, 2021

jorisvandenbossche commented Sep 15, 2021

devavret commented Sep 15, 2021

codecov bot commented Sep 22, 2021 • edited Loading

Codecov Report

jorisvandenbossche commented Sep 23, 2021

devavret commented Sep 23, 2021

jorisvandenbossche commented Sep 23, 2021

martindurant commented Sep 23, 2021

devavret commented Sep 23, 2021

rjzamora commented Oct 18, 2021

rjzamora commented Sep 13, 2021 •

edited

Loading

martindurant Sep 20, 2021 •

edited

Loading

codecov bot commented Sep 22, 2021 •

edited

Loading