-
Notifications
You must be signed in to change notification settings - Fork 953
Use owning Arrow types in C++ to expose data to Python #18402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use owning Arrow types in C++ to expose data to Python #18402
Conversation
…and use that to fix invalid test
This reverts commit 2d7ea5de01d141a6650068c4c3a513e2bc67b3c8.
… use those to construct gpumemoryviews safely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Packaging and C++ look good. I skimmed Python/Cython changes too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits in the C++ code, but LGTM otherwise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just non-blocking suggestions, thanks!
@@ -0,0 +1,9 @@ | |||
# Copyright (c) 2025, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think were missing type stubs _interop_helpers.pyi
. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because this file pretty much exclusively exposes pure Cython functions. There is nothing visible to Python except the ColumnMetadata type, but that is (for the moment) primarily publicly exposed via the interop module that already has the type stubs for that.
NANOARROW_RETURN_NOT_OK(set_contents(child_contents, tmp->children[i])); | ||
} else { | ||
NANOARROW_RETURN_NOT_OK(cudf::type_dispatcher( | ||
child->type(), dispatch_to_arrow_device{}, std::move(*child), stream, mr, child_ptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit paranoid on this: Do we have a guarantee that the initialization param_2 = std::move(*child)
happens before param_0 = child->type()
? If not, we may rely on the compiler's unspecified behavior for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf::type_dispatcher
passes std::move(*child)
with a forwarding reference, so the move has not happened yet when the function is entered, whereas param_0 = child->type()
is sequenced before the function is entered. So perhaps it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we are safe because while we may not be guaranteed anything about the order in which the two parameters are evaluated, we are guaranteed that all of the parameters are evaluated before the function call begins. std::move
doesn't actually do anything, it's just a cast to an rvalue that then allows the object to be modified. Therefore, even if we pass an rvalue ref to *child
into the type_dispatcher
(which forwards it along to the dispatch_to_arrow_device
functor), we are guaranteed that param1 = child->type()
is evaluated before *child
is actually modified in any way that could evaluate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ part looks good to me. Left one small suggestion, and one small question to confirm.
…data_structures_python
…data_structures_python
/merge |
9a2ccdf
into
rapidsai:branch-25.06
I forgot to post these benchmarks earlier, so posting for posterity (tl;dr this PR did not affect performance): Before
After
|
Description
This PR leverages #18084 to rework the Python layer of Arrow interchange. With this change, we can now expose the Arrow capsule interfaces for pylibcudf Columns and Tables. This PR also paves the way for exposing the device capsules, which will allow us to provide zero-copy Arrow views into pylibcudf objects.
To get everything working, this PR also makes some ancillary changes:
from_arrow
benchmark.Checklist