-
Notifications
You must be signed in to change notification settings - Fork 231
Description
Is your feature request related to a problem? Please describe.
During multi-GPU, single-node workflows, it is common for one GPU to hit OOM while others on the system are partially free or even completely idle.
As of 25.10, the best option is for the oversubscribed GPU to spill to host, either in the application layer or by using a CUDA managed memory resource.
In most cases, we expect it would be more efficient to make the allocations on a peer GPU on the same node, and access that data over NVLink. For example, this would mean that GPU0 would launch kernels accessing data on GPU1.
Describe the solution you'd like
This is a big topic and the main need right now is scoping.
Topic | Summary | Status |
---|---|---|
Confirm correctness for launching kernels on peer data | We haven't tested launching cuDF kernels on GPU0 while pointing to data on GPU1. We may need to provide more information to the kernel launch, and this could need big changes in cuDF or other RMM users. | |
Confirm performance for launching kernels on peer data | We haven't measured the performance impact of launching kernels on peer data. Maybe the data is able to move smoothly without additional work. However, it seems likely that we will need to call cudaMemAdvise or other hints to achieve good performance when peer data is accessed. |
|
Compose an MR that could be used with peer data | Most multi-GPU applications for cuDF are using a process-per-GPU model. The MR for each process could be composed of an MR for the primary GPU and child MRs for peer GPUs. The composed MR would go in RMM applications but not necessarily impact RMM code directly. | |
Scope and add any necessary resource adaptors to RMM | If it turns out that new resource adaptors are necessary or preferred, they should be added to RMM. Perhaps we need a "fallback" resource adaptor (#2074) to trigger going from primary to peer GPUs and a "round robin" resource adaptor to select the peer. |
Describe alternatives you've considered
Use spilling to host, either implicitly through CUDA managed memory or explicitly in the application layer.
Additional context
Clearly, peer-to-peer memory throughput will be lower than primary GPU memory throughput. However, many important kernels in data processing show low DRAM utilization (decompression, decoding, atomics-bound cuco count and retrieve). If it turns out that peer access results in similar kernel runtimes to primary access, then we might see benefits from treating the sum of GPU memory as a single pool.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status