-
Notifications
You must be signed in to change notification settings - Fork 953
Add owning types to hold Arrow data #18084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sting private data
Since there is alot of memory juggling here, could you run a compute-sanitizer check on these tests? |
I have not done that yet, and yes I can. |
Compute-sanitizer looks good:
|
Actually it would be best to run this with the cuda memory resource otherwise the pool memory resource could hide overwrites/reads.
|
Good call, I forgot about that. Also looks clean with that, no errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small questions, but overall it looks good to me! Thank you for detailed explanations on the ownership and release logic while shallow copying; it would be nice to preserve some portion of it (https://github.com/rapidsai/cudf/pull/18084/files#r1978708382, https://github.com/rapidsai/cudf/pull/18084/files#r1974287846 for instance) as comments in the code.
.array = {}, | ||
.device_id = 0, | ||
.device_type = ARROW_DEVICE_CUDA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it okay to (already) use a C++20 language feature?
Edit: is it only okay in .cpp files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a great question. I wasn't even thinking about this. Designated initializers are a C99 feature, so my guess is that C++ compilers have been supporting this as an extension since well before C++20 added this feature.
I could remove it if you feel strongly, but I've seen multiple other places in our code bases use this already so I don't think it's worth scrubbing. We'll move to C++20 soon enough (rapidsai/build-planning#113).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the compiler does not complain, neither will I :) I'm mostly surprised this compiles with the current version(s).
{ | ||
auto private_data = static_cast<VectorOfArrays*>(stream->private_data); | ||
|
||
[[maybe_unused]] auto rc = ArrowSchemaDeepCopy(private_data->schema.get(), out_schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume I missed a conversation :)
Why don't we check/return the error code here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find. This isn't actually new code, just moved from from_arrow_stream_test.cpp
, but now is a good time to clean this up. We should be guaranteed that this won't throw by construction, but no reason not to do the right thing and check here.
This is a great idea. I'll distill some of the discussion on this PR into comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved trivial CMake changes
/merge |
3bb9881
into
rapidsai:branch-25.04
The new types introduced in #18084 use the preexisting `to_arrow_device*` functions to produce views. Due to not every cudf type mapping perfectly to an Arrow type (historically decimals had misalignment until recent versions of arrow, and we still store boolean columns as-is rather than as bit columns like arrow) there are cases where that conversion requires allocating new memory. Previously there was no way to cache that conversion. Now, we can store the intermediate in the new types to avoid needing to reallocate on repeated calls. This change also synchronizes the APIs of the corresponding vanilla cudf column/table types. To improve that synchronization, the view creation happens upon creation of these types, allowing us to drop the stream and mr parameters from the view methods. For readability, I also aligned the ordering of the declarations in the header and the definitions in the files. I have not benchmarked the impact of these changes yet since we are not using these APIs anywhere significant yet. I plan to add in benchmarks as part of the PRs to leverage the new types from Python, at which point I plan to optimize as needed.
This PR leverages #18084 to rework the Python layer of Arrow interchange. With this change, we can now expose [the Arrow capsule interfaces](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for pylibcudf Columns and Tables. This PR also paves the way for exposing the device capsules, which will allow us to provide zero-copy Arrow views into pylibcudf objects. To get everything working, this PR also makes some ancillary changes: - These changes uncovered a number of places where the libcudf arrow interop code was not properly handling the NA type or 0 row columns and tables. Those cases have been fixed. - The code added in #18314 to support constructing pylibcudf Columns from a combination of a libcudf column_view and an arbitrary owner (as opposed to a Column owner) was incomplete. It worked in that PR because we don't actually do anything with Columns produced by the one use case tested there other than store the data then quickly unpack it (this was for packed columns). Using that code path with arrow columns uncovered a much bigger gap. The core issue is that gpumemoryview is constructed assuming that every object that it wraps has a CUDA Array interface. The new factory added in #18314 bypassed that, resulting in a gpumemoryview that was effectively in an invalid state for most operations. To fix this, I replaced the existing approach with a requirement that we wrap the existing owning object in something exposing a CAI before constructing the gpumemoryview. - To validate that these changes did not regress performance, I ran the Python benchmarks. In the process I added an additional `from_arrow` benchmark. - While running the benchmarks, I noticed an issue with our usage of pytest-benchmark due to dependency issues. I added a pinning in our repo for now and upstreamed fixes in conda-forge/conda-forge-repodata-patches-feedstock#990 and conda-forge/pytest-benchmark-feedstock#27. - I also addressed some of the outstanding comments from #18302 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Shruti Shivakumar (https://github.com/shrshi) - Matthew Murray (https://github.com/Matt711) - Tianyu Liu (https://github.com/kingcrimsontianyu) URL: #18402
Description
This PR introduces two new types to cudf,
arrow_column
andarrow_table
. These types are analogous tocudf::column
andcudf::table
, but instead of using cudf's unique ownership semantics these follow a shared ownership model that is more amenable for use with Arrow interop. These types are intended to be used in place of direct calls to Arrow interop functions likefrom_arrow_device
, which place the onus on the caller to track which APIs handle Arrow C Data Interface memory management semantics for you and which ones do not (see discussion in this thread for lots of examples). With the new types, the semantics are fairly straightforward and map to what one would expect when using the C Data Interface: the cudf objects either copy (in the case of host data) or move (in the case of device data) the input array structures and leave them in a released state afterwards.To keep the scope of this PR limited, I have implemented the core logic by simply calling the existing interop functions in suitable ways and only adding new logic as needed to handle proper ownership management. If we are happy with the new model, over time we can move to deprecate those code paths and move more of the logic directly into these classes.
Closes #16104
Checklist