Skip to content

[FEA] Introduce a new owning type for Arrow interop data #16104

Closed
@vyasr

Description

@vyasr

Is your feature request related to a problem? Please describe.
Currently libcudf's primary data type is the column, which is a wrapper around a set of rmm::device_buffer objects that own its memory. The column has sole ownership of the data via the underlying unique_ptr semantics. From an algorithmic perspective this is fine because libcudf operates entirely on views of data. While column provides convenient conversions to column_view, users of libcudf could just as easily create the views from any other data source because column_view objects may be constructed from an arbitrary set of data pointers (this is leveraged by cuDF Python, for instance, which handles ownership differently from libcudf and therefore extracts data from column objects as soon as any libcudf algorithm returns one).

However, the ownership semantics of column are not flexible enough to accommodate the ingestion of Arrow device data via the C Data interface. Historically, libcudf has only supported consuming host arrow data, which intrinsically requires a copy and therefore the creation of a column is fine. With the work done on #14926, though, libcudf now supports conversion from device arrow data. Since the purpose of the Arrow interface is to support zero-copy sharing of data across library boundaries, it supports the hand-off of data that we may need to keep alive for an indeterminate amount of time but that we should eventually release. Critically, in this scenario "releasing" does not mean freeing the data. Rather releasing is a process defined by the producer of the data that may free it, or may simply perform some other bookkeeping to allow it to be freed later (e.g. if there are shared memory semantics involved). There is no good way to represent these semantics with column right now. resulting in the more complex considerations laid out in this comment.

Describe the solution you'd like
We should define a new type cudf::arrow_column that can faithfully represent the Arrow interface's memory semantics. This type could be used for both host and device arrow data (i.e. by both from_arrow_host and from_arrow_device) and would be responsible for storing an Arrow[Device]Array and then calling its release pointer upon destruction. If necessary, this type could itself expose a mechanism by which its ArrowArray could be exported, i.e. it could become a producer for the C Data interface. To do so, it would need to wrap its own ArrowArray in an internal object with something like reference counting semantics so that the original producer's release callback would not be invoked until all re-exported arrays were also destroyed. This approach would allow us to unify all of the existing APIs into a simpler set of function overloads with well-defined memory semantics that the caller no longer has to be aware of.

Describe alternatives you've considered
We could alternatively try to find a way to support shared ownership at a deeper level by making it possible to construct a shared version of an rmm::device_buffer. This would require substantially more work, though, and might still require refactoring of cudf internals to use such an object. Simply making it possible to construct an rmm::device_buffer from a preexisting pointer possible in such a way that the buffer assumes ownership (analogous to the std::unique_pointer(pointer p) overload) would not be sufficient since what we need is a way for the buffer to not free the memory on deletion but to instead call the release callback. This seems out of scope for rmm and more like a cudf feature since it's specifically for arrow interop.

Additional context
The extended discussion leading to this issue may be found here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions