Add owning types to hold Arrow data #18084

vyasr · 2025-02-25T01:42:11Z

Description

This PR introduces two new types to cudf, arrow_column and arrow_table. These types are analogous to cudf::column and cudf::table, but instead of using cudf's unique ownership semantics these follow a shared ownership model that is more amenable for use with Arrow interop. These types are intended to be used in place of direct calls to Arrow interop functions like from_arrow_device, which place the onus on the caller to track which APIs handle Arrow C Data Interface memory management semantics for you and which ones do not (see discussion in this thread for lots of examples). With the new types, the semantics are fairly straightforward and map to what one would expect when using the C Data Interface: the cudf objects either copy (in the case of host data) or move (in the case of device data) the input array structures and leave them in a released state afterwards.

To keep the scope of this PR limited, I have implemented the core logic by simply calling the existing interop functions in suitable ways and only adding new logic as needed to handle proper ownership management. If we are happy with the new model, over time we can move to deprecate those code paths and move more of the logic directly into these classes.

Closes #16104

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…sting private data

…column

davidwendt · 2025-03-10T14:20:42Z

Since there is alot of memory juggling here, could you run a compute-sanitizer check on these tests?
Sorry if this has already been asked.

…column

vyasr · 2025-03-12T21:39:49Z

I have not done that yet, and yes I can.

…column

vyasr · 2025-03-13T23:13:19Z

Compute-sanitizer looks good:

$ compute-sanitizer --tool memcheck ./cpp/build/latest/INTEROP_TEST
...
========= ERROR SUMMARY: 0 errors

davidwendt · 2025-03-14T12:30:12Z

Compute-sanitizer looks good:

$ compute-sanitizer --tool memcheck ./cpp/build/latest/INTEROP_TEST
...
========= ERROR SUMMARY: 0 errors

Actually it would be best to run this with the cuda memory resource otherwise the pool memory resource could hide overwrites/reads.

$ compute-sanitizer --tool memcheck ./cpp/build/latest/INTEROP_TEST --rmm_mode=cuda

vyasr · 2025-03-14T19:08:14Z

Good call, I forgot about that. Also looks clean with that, no errors.

shrshi

A couple of small questions, but overall it looks good to me! Thank you for detailed explanations on the ownership and release logic while shallow copying; it would be nice to preserve some portion of it (https://github.com/rapidsai/cudf/pull/18084/files#r1978708382, https://github.com/rapidsai/cudf/pull/18084/files#r1974287846 for instance) as comments in the code.

cpp/include/cudf/interop.hpp

vuule · 2025-03-14T21:42:06Z

cpp/tests/interop/arrow_data_structures_test.cpp

+    .array       = {},
+    .device_id   = 0,
+    .device_type = ARROW_DEVICE_CUDA,


is it okay to (already) use a C++20 language feature?
Edit: is it only okay in .cpp files?

That is a great question. I wasn't even thinking about this. Designated initializers are a C99 feature, so my guess is that C++ compilers have been supporting this as an extension since well before C++20 added this feature.

I could remove it if you feel strongly, but I've seen multiple other places in our code bases use this already so I don't think it's worth scrubbing. We'll move to C++20 soon enough (rapidsai/build-planning#113).

If the compiler does not complain, neither will I :) I'm mostly surprised this compiles with the current version(s).

vuule · 2025-03-14T22:48:24Z

cpp/include/cudf_test/nanoarrow_utils.hpp

+  {
+    auto private_data = static_cast<VectorOfArrays*>(stream->private_data);
+
+    [[maybe_unused]] auto rc = ArrowSchemaDeepCopy(private_data->schema.get(), out_schema);


I assume I missed a conversation :)
Why don't we check/return the error code here?

Good find. This isn't actually new code, just moved from from_arrow_stream_test.cpp, but now is a good time to clean this up. We should be guaranteed that this won't throw by construction, but no reason not to do the right thing and check here.

vyasr · 2025-03-14T23:24:13Z

A couple of small questions, but overall it looks good to me! Thank you for detailed explanations on the ownership and release logic while shallow copying; it would be nice to preserve some portion of it (https://github.com/rapidsai/cudf/pull/18084/files#r1978708382, https://github.com/rapidsai/cudf/pull/18084/files#r1974287846 for instance) as comments in the code.

This is a great idea. I'll distill some of the discussion on this PR into comments.

KyleFromNVIDIA

Approved trivial CMake changes

vyasr · 2025-03-17T21:00:39Z

/merge

The new types introduced in #18084 use the preexisting `to_arrow_device*` functions to produce views. Due to not every cudf type mapping perfectly to an Arrow type (historically decimals had misalignment until recent versions of arrow, and we still store boolean columns as-is rather than as bit columns like arrow) there are cases where that conversion requires allocating new memory. Previously there was no way to cache that conversion. Now, we can store the intermediate in the new types to avoid needing to reallocate on repeated calls. This change also synchronizes the APIs of the corresponding vanilla cudf column/table types. To improve that synchronization, the view creation happens upon creation of these types, allowing us to drop the stream and mr parameters from the view methods. For readability, I also aligned the ordering of the declarations in the header and the definitions in the files. I have not benchmarked the impact of these changes yet since we are not using these APIs anywhere significant yet. I plan to add in benchmarks as part of the PRs to leverage the new types from Python, at which point I plan to optimize as needed.

This PR leverages #18084 to rework the Python layer of Arrow interchange. With this change, we can now expose [the Arrow capsule interfaces](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for pylibcudf Columns and Tables. This PR also paves the way for exposing the device capsules, which will allow us to provide zero-copy Arrow views into pylibcudf objects. To get everything working, this PR also makes some ancillary changes: - These changes uncovered a number of places where the libcudf arrow interop code was not properly handling the NA type or 0 row columns and tables. Those cases have been fixed. - The code added in #18314 to support constructing pylibcudf Columns from a combination of a libcudf column_view and an arbitrary owner (as opposed to a Column owner) was incomplete. It worked in that PR because we don't actually do anything with Columns produced by the one use case tested there other than store the data then quickly unpack it (this was for packed columns). Using that code path with arrow columns uncovered a much bigger gap. The core issue is that gpumemoryview is constructed assuming that every object that it wraps has a CUDA Array interface. The new factory added in #18314 bypassed that, resulting in a gpumemoryview that was effectively in an invalid state for most operations. To fix this, I replaced the existing approach with a requirement that we wrap the existing owning object in something exposing a CAI before constructing the gpumemoryview. - To validate that these changes did not regress performance, I ran the Python benchmarks. In the process I added an additional `from_arrow` benchmark. - While running the benchmarks, I noticed an issue with our usage of pytest-benchmark due to dependency issues. I added a pinning in our repo for now and upstreamed fixes in conda-forge/conda-forge-repodata-patches-feedstock#990 and conda-forge/pytest-benchmark-feedstock#27. - I also addressed some of the outstanding comments from #18302 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Shruti Shivakumar (https://github.com/shrshi) - Matthew Murray (https://github.com/Matt711) - Tianyu Liu (https://github.com/kingcrimsontianyu) URL: #18402

vyasr added 30 commits February 25, 2025 01:28

First attempt

015ddbd

Get basic conversion working starting with a cudf column

7661570

Enable construction from an ArrowDeviceArray

f01a07b

CMake testing changes

851d4dc

Use nanoarray for copying schema

a09e87d

Fix reference bug

e9865bb

More CMake testing changes

a07b2ae

Enable getting views and try asserting equivalence

bd65da6

Fully passing first round of tests.

a55e508

Add explicit test of lifetime management

943eb2b

Basic arrow table

461c6c8

Add arrow conversions

362529d

Support construction from device array

fda4ca4

Some cleanup

7c07058

Add documentation

11ec3d9

Fix some bugs

2c7e78a

More cleanup

f75b17b

More CMake testing

951914d

Get tests passing with complex nanoarrow host tables

31f134e

More CMake testing

a76c3b9

Support single columns from arrow host data

6a59f41

Also test nanoarrow device data

9600c3c

Implement conversion to host array

68fb9d8

Support direct ingestion of ArrowArray data

36bbd36

Support construction from streams

0bc9389

Make ownership semantics consistent across types

2b2e77f

Make sure stream and mr are forwarded everywhere

89b34c3

Centralize as much logic as possible

eb93262

Dictionary behavior is correct since we are just pointing back to exi…

374e7ac

…sting private data

Update comments

7014f30

Merge remote-tracking branch 'upstream/branch-25.04' into feat/arrow_…

2e230fe

…column

vyasr added 2 commits March 12, 2025 20:33

Merge remote-tracking branch 'upstream/branch-25.04' into feat/arrow_…

9c7c8ae

…column

Merge remote-tracking branch 'upstream/branch-25.04' into feat/arrow_…

f9091bb

…column

vyasr added 3 commits March 13, 2025 23:10

Update for new header location

e34356e

Fix bug introduced in GH review application

5f36e12

Merge remote-tracking branch 'upstream/branch-25.04' into feat/arrow_…

42c68be

…column

davidwendt approved these changes Mar 13, 2025

View reviewed changes

Appease linter

bb7ce4b

shrshi approved these changes Mar 14, 2025

View reviewed changes

cpp/include/cudf/interop.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/interop.hpp Outdated Show resolved Hide resolved

vuule reviewed Mar 14, 2025

View reviewed changes

vyasr added 3 commits March 17, 2025 18:35

Add description of the memory management logic and update some comments.

e401959

Make to_arrow_schema const

83e4b59

Check for error

0d9df33

vyasr requested a review from vuule March 17, 2025 19:16

KyleFromNVIDIA approved these changes Mar 17, 2025

View reviewed changes

rapids-bot bot merged commit 3bb9881 into rapidsai:branch-25.04 Mar 17, 2025
108 of 109 checks passed

vyasr deleted the feat/arrow_column branch March 17, 2025 21:07

vyasr mentioned this pull request Mar 17, 2025

Cache column view creation from arrow types #18302

Merged

3 tasks

This was referenced Apr 1, 2025

Use owning Arrow types in C++ to expose data to Python #18402

Merged

[FEA] Introduce a new owning type for Arrow interop data #16104

Closed

Add owning types to hold Arrow data #18084

Add owning types to hold Arrow data #18084

Uh oh!

Conversation

vyasr commented Feb 25, 2025

Description

Checklist

Uh oh!

davidwendt commented Mar 10, 2025

Uh oh!

vyasr commented Mar 12, 2025

Uh oh!

vyasr commented Mar 13, 2025

Uh oh!

davidwendt commented Mar 14, 2025

Uh oh!

vyasr commented Mar 14, 2025

Uh oh!

shrshi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vuule Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vyasr Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

vuule Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

vuule Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

vyasr Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vyasr commented Mar 14, 2025

Uh oh!

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

Uh oh!

vyasr commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

vuule Mar 14, 2025 •

edited

Loading

vyasr Mar 17, 2025 •

edited

Loading