Description
Description
When using nvmlDeviceGetComputeRunningProcesses
to get the usedGpuMemory
of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared with nvidia-smi
in the terminal, the usedGpuMemory
contained the value of the process ID while the pid
field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output be nvmlDeviceGetComputeRunningProcesses
overall shuffled. Investigation is warranted to ensure nvmlDeviceGetComputeRunningProcesses
consistently provides correct output.
Code for reproducing the bug
import pynvml.nvml as nvml
import multiprocess as mp
import torch
def main():
event = mp.Event()
profiling_process = mp.Process(target=_profile_resources, kwargs={'event': event})
profiling_process.start()
with mp.Pool(8) as pool:
for res in [pool.apply_async(_multiprocess_task, (i,)) for i in range(12)]:
res.get()
event.set()
profiling_process.join()
profiling_process.close()
def _profile_resources(event):
nvml.nvmlInit()
while True:
handle = nvml.nvmlDeviceGetHandleByIndex(0)
gpu_processes = nvml.nvmlDeviceGetComputeRunningProcesses(handle)
print(gpu_processes)
time.sleep(.1)
if event.is_set():
break
def _multiprocess_task(num: int):
t1 = torch.tensor([1.1] * int(5**num)).to(torch.device('cuda:0'))
t2 = torch.tensor([2.2] * int(5**num)).to(torch.device('cuda:0'))
time.sleep(1)
return (t1 * t2).shape
Environment
torch==2.0.1
pynvml=11.5.0
CUDA version: 12.2
GPU Model: NVIDIA GeForce RTX 4080
Driver Version: 535.54.03