Skip to content

BUG: Output of nvml.nvmlDeviceGetComputeRunningProcesses reports pid as usedGpuMemory and usedGpuMemory as pid #50

Open
@erikhuck

Description

@erikhuck

Description

When using nvmlDeviceGetComputeRunningProcesses to get the usedGpuMemory of all the processes using a particular GPU (in this case, GPU 0), I saw that erroneous results appeared to be reported. When compared with nvidia-smi in the terminal, the usedGpuMemory contained the value of the process ID while the pid field, rather than containing the process ID, contained the used GPU memory. So the values were swapped. Sometimes other fields in the process object contained the process ID or GPU memory values, making the field values of the process objects output be nvmlDeviceGetComputeRunningProcesses overall shuffled. Investigation is warranted to ensure nvmlDeviceGetComputeRunningProcesses consistently provides correct output.

Code for reproducing the bug

import pynvml.nvml as nvml
import multiprocess as mp
import torch

def main():
    event = mp.Event()
    profiling_process = mp.Process(target=_profile_resources, kwargs={'event': event})
    profiling_process.start()
    with mp.Pool(8) as pool:
        for res in [pool.apply_async(_multiprocess_task, (i,)) for i in range(12)]:
            res.get()
    event.set()
    profiling_process.join()
    profiling_process.close()

def _profile_resources(event):
    nvml.nvmlInit()
    while True:
        handle = nvml.nvmlDeviceGetHandleByIndex(0)
        gpu_processes = nvml.nvmlDeviceGetComputeRunningProcesses(handle)
        print(gpu_processes)
        time.sleep(.1)
        if event.is_set():
            break

def _multiprocess_task(num: int):
    t1 = torch.tensor([1.1] * int(5**num)).to(torch.device('cuda:0'))
    t2 = torch.tensor([2.2] * int(5**num)).to(torch.device('cuda:0'))
    time.sleep(1)
    return (t1 * t2).shape

Environment

torch==2.0.1
pynvml=11.5.0
CUDA version: 12.2
GPU Model: NVIDIA GeForce RTX 4080
Driver Version: 535.54.03

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions