Skip to content

Add owning types to hold Arrow data #18084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 59 commits into from
Mar 17, 2025

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Feb 25, 2025

Description

This PR introduces two new types to cudf, arrow_column and arrow_table. These types are analogous to cudf::column and cudf::table, but instead of using cudf's unique ownership semantics these follow a shared ownership model that is more amenable for use with Arrow interop. These types are intended to be used in place of direct calls to Arrow interop functions like from_arrow_device, which place the onus on the caller to track which APIs handle Arrow C Data Interface memory management semantics for you and which ones do not (see discussion in this thread for lots of examples). With the new types, the semantics are fairly straightforward and map to what one would expect when using the C Data Interface: the cudf objects either copy (in the case of host data) or move (in the case of device data) the input array structures and leave them in a released state afterwards.

To keep the scope of this PR limited, I have implemented the core logic by simply calling the existing interop functions in suitable ways and only adding new logic as needed to handle proper ownership management. If we are happy with the new model, over time we can move to deprecate those code paths and move more of the logic directly into these classes.

Closes #16104

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt
Copy link
Contributor

Since there is alot of memory juggling here, could you run a compute-sanitizer check on these tests?
Sorry if this has already been asked.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 12, 2025

I have not done that yet, and yes I can.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 13, 2025

Compute-sanitizer looks good:

$ compute-sanitizer --tool memcheck ./cpp/build/latest/INTEROP_TEST
...
========= ERROR SUMMARY: 0 errors

@davidwendt
Copy link
Contributor

Compute-sanitizer looks good:

$ compute-sanitizer --tool memcheck ./cpp/build/latest/INTEROP_TEST
...
========= ERROR SUMMARY: 0 errors

Actually it would be best to run this with the cuda memory resource otherwise the pool memory resource could hide overwrites/reads.

$ compute-sanitizer --tool memcheck ./cpp/build/latest/INTEROP_TEST --rmm_mode=cuda

@vyasr
Copy link
Contributor Author

vyasr commented Mar 14, 2025

Good call, I forgot about that. Also looks clean with that, no errors.

Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small questions, but overall it looks good to me! Thank you for detailed explanations on the ownership and release logic while shallow copying; it would be nice to preserve some portion of it (https://github.com/rapidsai/cudf/pull/18084/files#r1978708382, https://github.com/rapidsai/cudf/pull/18084/files#r1974287846 for instance) as comments in the code.

Comment on lines +208 to +210
.array = {},
.device_id = 0,
.device_type = ARROW_DEVICE_CUDA,
Copy link
Contributor

@vuule vuule Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it okay to (already) use a C++20 language feature?
Edit: is it only okay in .cpp files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great question. I wasn't even thinking about this. Designated initializers are a C99 feature, so my guess is that C++ compilers have been supporting this as an extension since well before C++20 added this feature.

I could remove it if you feel strongly, but I've seen multiple other places in our code bases use this already so I don't think it's worth scrubbing. We'll move to C++20 soon enough (rapidsai/build-planning#113).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the compiler does not complain, neither will I :) I'm mostly surprised this compiles with the current version(s).

{
auto private_data = static_cast<VectorOfArrays*>(stream->private_data);

[[maybe_unused]] auto rc = ArrowSchemaDeepCopy(private_data->schema.get(), out_schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume I missed a conversation :)
Why don't we check/return the error code here?

Copy link
Contributor Author

@vyasr vyasr Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find. This isn't actually new code, just moved from from_arrow_stream_test.cpp, but now is a good time to clean this up. We should be guaranteed that this won't throw by construction, but no reason not to do the right thing and check here.

@vyasr
Copy link
Contributor Author

vyasr commented Mar 14, 2025

A couple of small questions, but overall it looks good to me! Thank you for detailed explanations on the ownership and release logic while shallow copying; it would be nice to preserve some portion of it (https://github.com/rapidsai/cudf/pull/18084/files#r1978708382, https://github.com/rapidsai/cudf/pull/18084/files#r1974287846 for instance) as comments in the code.

This is a great idea. I'll distill some of the discussion on this PR into comments.

@vyasr vyasr requested a review from vuule March 17, 2025 19:16
Copy link
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

@vyasr
Copy link
Contributor Author

vyasr commented Mar 17, 2025

/merge

@rapids-bot rapids-bot bot merged commit 3bb9881 into rapidsai:branch-25.04 Mar 17, 2025
108 of 109 checks passed
@vyasr vyasr deleted the feat/arrow_column branch March 17, 2025 21:07
raydouglass pushed a commit that referenced this pull request Mar 20, 2025
The new types introduced in #18084 use the preexisting
`to_arrow_device*` functions to produce views. Due to not every cudf
type mapping perfectly to an Arrow type (historically decimals had
misalignment until recent versions of arrow, and we still store boolean
columns as-is rather than as bit columns like arrow) there are cases
where that conversion requires allocating new memory. Previously there
was no way to cache that conversion. Now, we can store the intermediate
in the new types to avoid needing to reallocate on repeated calls. This
change also synchronizes the APIs of the corresponding vanilla cudf
column/table types. To improve that synchronization, the view creation
happens upon creation of these types, allowing us to drop the stream and
mr parameters from the view methods. For readability, I also aligned the
ordering of the declarations in the header and the definitions in the
files.

I have not benchmarked the impact of these changes yet since we are not
using these APIs anywhere significant yet. I plan to add in benchmarks
as part of the PRs to leverage the new types from Python, at which point
I plan to optimize as needed.
rapids-bot bot pushed a commit that referenced this pull request Apr 16, 2025
This PR leverages #18084 to rework the Python layer of Arrow interchange. With this change, we can now expose [the Arrow capsule interfaces](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for pylibcudf Columns and Tables. This PR also paves the way for exposing the device capsules, which will allow us to provide zero-copy Arrow views into pylibcudf objects.

To get everything working, this PR also makes some ancillary changes:
- These changes uncovered a number of places where the libcudf arrow interop code was not properly handling the NA type or 0 row columns and tables. Those cases have been fixed.
- The code added in #18314 to support constructing pylibcudf Columns from a combination of a libcudf column_view and an arbitrary owner (as opposed to a Column owner) was incomplete. It worked in that PR because we don't actually do anything with Columns produced by the one use case tested there other than store the data then quickly unpack it (this was for packed columns). Using that code path with arrow columns uncovered a much bigger gap. The core issue is that gpumemoryview is constructed assuming that every object that it wraps has a CUDA Array interface. The new factory added in #18314 bypassed that, resulting in a gpumemoryview that was effectively in an invalid state for most operations. To fix this, I replaced the existing approach with a requirement that we wrap the existing owning object in something exposing a CAI before constructing the gpumemoryview.
- To validate that these changes did not regress performance, I ran the Python benchmarks. In the process I added an additional `from_arrow` benchmark.
- While running the benchmarks, I noticed an issue with our usage of pytest-benchmark due to dependency issues. I added a pinning in our repo for now and upstreamed fixes in conda-forge/conda-forge-repodata-patches-feedstock#990 and conda-forge/pytest-benchmark-feedstock#27.
- I also addressed some of the outstanding comments from #18302

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Shruti Shivakumar (https://github.com/shrshi)
  - Matthew Murray (https://github.com/Matt711)
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

URL: #18402
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Introduce a new owning type for Arrow interop data
6 participants