[BUG] Potential for use-after-free on libcudf/cudf interface boundaries #14229

wence- · 2023-09-28T17:29:46Z

Describe the bug

This comes out of a review of #14133 which introduces a new Scalar type on the python side in pylibcudf, but I think that the issues are pervasive.

Background

libcudf uses RMM for all memory allocations, this happens through a memory_resource object. Allocating libcudf functions all take an explicit memory_resource argument (as a raw pointer) that is defaulted to rmm::get_current_device_resource() (the resource set by rmm::set_per_device_resource). RMM device_buffers hold a raw pointer of their allocating memory resource (needed for deallocation), and it is the user's (in this case libcudf's) responsibility to keep that memory_resource alive until the device_buffer has been dropped:

This is fine:

memory_resource mr = ...;
{
    device_buffer buf{..., &mr};
    ... // do things with buf
   ~buf(); // called here, fine.
}

This is not (this is not a real example, since I think the device_buffer constructors don't allow exactly this, but bear with me):

device_buffer buf;
{
   memory_resource mr = ...;
   buf = {..., &mr};
   ... // do things with buf
   ~mr(); // called here, boom;
}
buf no longer valid

How does `get_current_device_resource` work?

set_per_device_resource stores a raw pointer to a memory resource in a std::unordered_map and get_current_device_resource just looks up in the map. Again, the user is responsible for keeping the mr alive.

How does this work in cudf (python) land?

The Cython wrappers in RMM expose memory resources, and set/get_current_device_resource. set_per_device_memory_resource sets the value in both the C++ level std::unordered_map and a Python level dict (https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/memory_resource.pyx#L1041-L1049). get_current_device_resource looks up in the Python dict (https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/memory_resource.pyx#L1013-1026). DeviceBuffers keep the "current" memory resource alive by storing a reference: https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/device_buffer.pyx#L93-L93, but when taking ownership of a C++-level device buffer https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/device_buffer.pyx#L161-L171, we don't use the mr stored in the C++ struct.

The Python level dict is "necessary" due to the usual expected way in which people will use things from python. That is they will expect that: set_per_device_resource(device, CudaMemoryResource()) keeps the created memory resource alive.

If a C++ library calls set_per_device_resource at some point, C++-land (libcudf) and Python-land (cudf) can end up disagreeing about what the current memory resource is (because Python-level get_current_device_resource doesn't look at the C++-level map).

So what's the problem?

If libcudf ever allocates memory that it hands off to cudf with a memory resource that is not managed by cudf, we have the potential for use-after-free, because cudf has no way of taking (or sharing) ownership of that memory resource from the RMM buffer.

AFAICT, cudf never explicitly passes a memory resource into libcudf, and so the allocation behaviour is always relying on C++ and Python agreeing about what get_current_device_resource returns, and that cudf was the creator of that resource. Effectively the pattern is:

# In python
...
1. mr = rmm.get_current_device_resource()
# No mr passed here, so C++-level get_current_device_resource() is used
2. new_column = libcudf.some_algorithm(existing_data)
# take ownership of the data, fingers crossed that new_column.data.mr is mr
3. cudf_column = Column.from_unique_ptr(new_column, mr)

If either someone in C++ has set a different per-device resource or the current thread is pre-empted between lines 1 and 2 (and the current device resource is changed), then line 3 will take "ownership" of the data, but not be keeping the correct memory resource for the data alive.

How can we fix this?

If the Cython layer in cudf always explicitly passes a memory resource to libcudf algorithms, this problem goes away. We can then always guarantee that the memory resource we have on the Python side is the one used to allocate the data in libcudf so we can keep it alive.

Alternately, if the memory_resource pointers in RMM were smart pointers it might be possible to keep things alive that way. Right now we can't make Python and C++ always agree on what the current default resource is (because on the C++ side RMM doesn't have a smart pointer stored, because it doesn't take ownership).

The text was updated successfully, but these errors were encountered:

wence- · 2023-09-29T08:57:35Z

After some (long) discussions, my current position is:

cudf Python should explicitly pass memory resources to all calls to libcudf that require them.
It was a mistake to provide defaulted mr = rmm::mr::get_current_device_resource() in all functions in libcudf that allocate data, since in the presence of a third-party library that is also using RMM we have no control of the provenance of this pointer and therefore cannot guarantee the lifetime of the memory resource it points to.

wence- added bug Something isn't working Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. RMM 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Sep 28, 2023

vyasr mentioned this issue Oct 4, 2023

Add pylibcudf.Scalar that interoperates with Arrow scalars #14133

Merged

3 tasks

vyasr mentioned this issue Feb 2, 2024

[BUG] Consider cleaning up rmm global state modification in spilling #14960

Closed

vyasr removed the RMM label Feb 23, 2024

wence- mentioned this issue Feb 28, 2024

[FEA] Expose memory_resource arguments in pylibcudf #15170

Open

This was referenced Mar 4, 2024

[DOC] Document memory resource lifetime requirements when taking ownership of device buffers in Python rapidsai/rmm#1492

Closed

[FEA] Be consistent in handling of default parameter values in pylibcudf #15198

Open

wence- mentioned this issue Apr 2, 2024

[FEA] Ensure that Cython wrappers for allocating objects correctly expose and use memory resources rapidsai/rmm#1515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Potential for use-after-free on libcudf/cudf interface boundaries #14229

[BUG] Potential for use-after-free on libcudf/cudf interface boundaries #14229

wence- commented Sep 28, 2023

wence- commented Sep 29, 2023

[BUG] Potential for use-after-free on libcudf/cudf interface boundaries #14229

[BUG] Potential for use-after-free on libcudf/cudf interface boundaries #14229

Comments

wence- commented Sep 28, 2023

Background

How does get_current_device_resource work?

How does this work in cudf (python) land?

So what's the problem?

How can we fix this?

wence- commented Sep 29, 2023

How does `get_current_device_resource` work?