Description
Is your feature request related to a problem? Please describe.
I would like to generate Arrow IPC payloads from a cudf::table
without copying the data off of the GPU device. Currently the to_arrow
and from_arrow
functions explicitly perform copies to and from the GPU device. There is not currently any efficient way to generate Arrow IPC payloads from libcudf without copying all of the data off of the device.
Describe the solution you'd like
In addition to the existing to_arrow
and from_arrow
functions, we could have a to_arrow_device_arr
function that populates an ArrowDeviceArray
struct from a cudf::table
or cudf::column
. We'd also create a from_arrow_device_arr
function that could construct a cudf::table
/ cudf::column
from an ArrowDeviceArray
that describes Arrow data which is already on the device. Once you have the ArrowDeviceArray
struct, the Arrow C++ library itself can be used to generate the IPC payloads without needing to copy the data off the device. This would also increase the interoperability options that libcudf has, as anything which produces or consumes ArrowDeviceArray
structs could hand data off to libcudf and vice versa.
Describe alternatives you've considered
An alternative would be to implement Arrow IPC creating inside of the libcudf library, but I saw that this was explicitly removed from libcudf due to the requirement of linking against libarrow_cuda.so
. (#10994). Implementing conversions to and from ArrowDeviceArray
wouldn't require linking against libarrow_cuda.so
at all and would provide an easy way to allow any consumers to create Arrow IPC payloads, or whatever else they want to do with the resulting Arrow data. Such as leveraging CUDA IPC with the data.
Additional context
When designing the ArrowDeviceArray
struct, I created https://github.com/zeroshade/arrow-non-cpu as a POC which used Python numba to generate and operate on some GPU data before handing it off to libcudf, and then getting it back without copying off the device. Using ArrowDeviceArray
as the way it handed the data off.
More recently I've been working on creating a protocol for sending Arrow IPC data that is located on GPUs across high-performance transports like UCX. To this end, I created a POC using libcudf to pass the data. As a result I have a partial implementation of the to_arrow_device_arr
which can be found here. There's likely better ways than what I'm doing in there, but at least for my POC it was working.
The contribution guidelines say I should file this issue first for discussion rather than just submitting a PR, so that's where I'm at. I plan on trying to create a full implementation that I can contribute but wanted to have this discussion and get feedback here first.
Thanks for hearing me out everyone!