Skip to content

[FEA] Implement a PerDeviceResourceManager #2043

@vyasr

Description

@vyasr

Is your feature request related to a problem? Please describe.
rmm currently exposes a default mapping of device ids to default memory resources that can be accessed using free functions like get_current_device_resource. These functions have been useful for a long time, and in both C++ and Python rmm's default mr mapping has become the de facto centralized source of memory resources across all of RAPIDS. While there are a number of ergonomic benefits of having this functionality, the most important benefit historically has been the simple fact that we wanted all RAPIDS libraries to be able to easily share the same application-managed pool of CUDA memory because having multiple pools will lead to undesirable memory contention.

With advances in newer CUDA versions and especially CUDA 13, this is no longer as much of an issue. As of CUDA 11.2, CUDA's stream-ordered allocators support pools. With CUDA 13, it is now possible to allocat asynchronously (i.e. in a stream-ordered fashion) from an arbitrary pool using cuMemAllocFromPoolAsync]. With this functionality, starting in CUDA 13 we can effectively use a single entry point for allocating all types of CUDA memory in all cases. If we need legacy synchronous allocation we can simply pass in the default stream, and if we need specific types of memory like managed memory we can simply use the appropriate pool. The most important property of using these new APIs is that the memory pools are handled automatically by the CUDA driver, so the pools no longer have to be explicitly coordinated between applications. If an application requires a very specific pool it can create one, but in almost all cases we should be fine using the default pools provided by the driver, which are accessible via the cuMemGetDefaultMemPool API.

Meanwhile, the problems with using rmm's default mrs in every other application have become more acute over time. The desire for using different types of memory in different applications (for example, cudf might wish to use managed memory by default) result in the usual problems with any sort of global state: one library's modifications implicit affect everyone else. Currently the only way for libraries to avoid using rmm's default mrs is to manually create their own such set internally and then pass those mrs into every single API call. The latter problem must be solved by each library implementing suitable internal interfaces around its preferred set of mrs, but the former can be facilitated by rmm.

Describe the solution you'd like
I would like rmm's current functionality for a per-device resource map to be wrapped into a class that can be reused by other libraries to provide their own per-device resources. rmm can itself maintain a global singleton instance of this class to provide a backwards-compatible shim to the legacy APIs for its own per-device resource, but other libraries will be able to leverage the new class to more easily avoid rmm's global store. By encouraging increased usage of rmm mrs that use modern CUDA memory pools, we can get better pooling behavior than rmm's previously existing pool since we'll be pooling at the level of the CUDA driver. Meanwhile, this new approach will allow different applications to make different choices on how they want memory allocated without potentially impacting other libraries. Some examples beyond the managed memory example linked above include more precise per-library statistics and logging, or different sized binning adaptors depending on the allocation patterns of specific libraries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    To-do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions