-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA/Proposal] Use Arrow backed filesystem objects for reading remote files #7475
Comments
So I made a quick prototype for testing out the performance improvement that could be realized here where only a couple of rows from a "large" parquet file might be desired. It works something like this.
I saw quite significant performance improvements when reading a parquet file on S3 that was roughly 480mb with 18 columns, but only reading a single row from that file vs the entire thing. The prototype output in included below to show the speedup. Note the results are in microseconds and less is better.
|
@kkraus14 would you be opposed to me making a PR for this? I have a couple of questions first though to try and streamline the process.
|
I'd welcome a PR to iterate further! |
What would this function do? Create a datasource from a uri? |
Pretty much. This is just me thinking from my head, but shouldn't be much more complicated than that. The full "datasource" I made in the example really the only work was in its constructor so was thinking just another function should suffice. |
Sounds great to me! |
This issue has been labeled |
Still a desired [FEA]. PR has activity. |
This issue has been labeled |
Still desired. Waiting on upstream arrow changes. |
) Arrow offers an API that allows for users to provide a uri definition for target files. This PR will use that api and create a new `arrow_io_source` constructor to accept that information from the user and then create the appropriate FileSystem instance and configure it for access to that file. This closes: #7475 Authors: - Jeremy Dyer (https://github.com/jdye64) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Robert Maynard (https://github.com/robertmaynard) - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule) URL: #7709
There's some ongoing work by @shridharathi to expose the arrow filesystem objects branch for s3 paths in this PR: #8961. The approach essentially is to check for whether the path is an s3 path and create an arrow filesystem object instead of going via the fsspec route. The PR currently implements this for csv only, but would be extended more generally in ioutils for other file formats as well. There are some cases around translating fsspec understood storage options to params that can be understood by PyArrow that's still being worked out. Though I feel it's worth reopening this issue since the python side of things is still a WIP. |
This issue is now solved since it is possible to read only the necessary parts of remote files. We may have to revisit whether we can keep the solution as part of #15193, but either way this issue is either going to stay closed or we will need to come up with an alternative solution that doesn't use the |
Is your feature request related to a problem? Please describe.
The current python reading pipeline reads the entire buffer when reading from non-local datasources like HDFS, s3 etc. This is suboptimal for cases where a user reads a subset of the data (few columns, row based filtering) as the entire buffer is still pulled from the datasource.
All remote reading currently goes via
fsspec
which provides a python filesystem interface for reading remote sources. Since the python filesystem object is not backed by a cpp object, the raw buffer needs read and passed down to libcudf.This can hamper performance, especially in network bound environments when reading from the cloud.
Describe the solution you'd like
To use Arrow FileSystem objects, wherever applicable. These objects are backed by arrow cpp objects (CRandomAccessFile) which can then be passed down to libcudf, which has the logic to read the buffers as required based on the reader options and the file format.
Implementation Idea
pyarrow.FileSystem
to create a file object wherever applicable in ioutils.get_filepath_or_bufferUse arrow_io_datasource to create an arrow_datasource that can be directly consumed by libcudf.
Describe alternatives you've considered
Additional context
Some initial discussion on the topic
#2760 (comment)
#2760 (comment)
cc: @randerzander @vuule @rjzamora
The text was updated successfully, but these errors were encountered: