[BUG] Potential for use-after-free on libcudf/cudf interface boundaries #14229
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Describe the bug
This comes out of a review of #14133 which introduces a new
Scalar
type on the python side in pylibcudf, but I think that the issues are pervasive.Background
libcudf
uses RMM for all memory allocations, this happens through amemory_resource
object. Allocatinglibcudf
functions all take an explicitmemory_resource
argument (as a raw pointer) that is defaulted tormm::get_current_device_resource()
(the resource set byrmm::set_per_device_resource
). RMMdevice_buffer
s hold a raw pointer of their allocating memory resource (needed for deallocation), and it is the user's (in this caselibcudf
's) responsibility to keep thatmemory_resource
alive until thedevice_buffer
has been dropped:This is fine:
This is not (this is not a real example, since I think the
device_buffer
constructors don't allow exactly this, but bear with me):How does
get_current_device_resource
work?set_per_device_resource
stores a raw pointer to a memory resource in astd::unordered_map
andget_current_device_resource
just looks up in the map. Again, the user is responsible for keeping themr
alive.How does this work in cudf (python) land?
The Cython wrappers in RMM expose memory resources, and
set/get_current_device_resource
.set_per_device_memory_resource
sets the value in both the C++ levelstd::unordered_map
and a Python level dict (https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/memory_resource.pyx#L1041-L1049).get_current_device_resource
looks up in the Python dict (https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/memory_resource.pyx#L1013-1026).DeviceBuffer
s keep the "current" memory resource alive by storing a reference: https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/device_buffer.pyx#L93-L93, but when taking ownership of a C++-level device buffer https://github.com/rapidsai/rmm/blob/5f07014db51535902f8cdb515596af162d1ae6ca/python/rmm/_lib/device_buffer.pyx#L161-L171, we don't use the mr stored in the C++ struct.The Python level dict is "necessary" due to the usual expected way in which people will use things from python. That is they will expect that:
set_per_device_resource(device, CudaMemoryResource())
keeps the created memory resource alive.If a C++ library calls
set_per_device_resource
at some point, C++-land (libcudf
) and Python-land (cudf) can end up disagreeing about what the current memory resource is (because Python-levelget_current_device_resource
doesn't look at the C++-level map).So what's the problem?
If
libcudf
ever allocates memory that it hands off to cudf with a memory resource that is not managed by cudf, we have the potential for use-after-free, because cudf has no way of taking (or sharing) ownership of that memory resource from the RMM buffer.AFAICT, cudf never explicitly passes a memory resource into
libcudf
, and so the allocation behaviour is always relying on C++ and Python agreeing about whatget_current_device_resource
returns, and that cudf was the creator of that resource. Effectively the pattern is:If either someone in C++ has set a different per-device resource or the current thread is pre-empted between lines 1 and 2 (and the current device resource is changed), then line 3 will take "ownership" of the data, but not be keeping the correct memory resource for the data alive.
How can we fix this?
If the Cython layer in cudf always explicitly passes a memory resource to
libcudf
algorithms, this problem goes away. We can then always guarantee that the memory resource we have on the Python side is the one used to allocate the data in libcudf so we can keep it alive.Alternately, if the memory_resource pointers in RMM were smart pointers it might be possible to keep things alive that way. Right now we can't make Python and C++ always agree on what the current default resource is (because on the C++ side RMM doesn't have a smart pointer stored, because it doesn't take ownership).
The text was updated successfully, but these errors were encountered: