Skip to content

DivideByZero error causing plots not to update #234

@jacobtomlinson

Description

@jacobtomlinson

On systems with shared memory the total device memory returned by pynvml is 0. This results in some DivideByZero errors in places where this value is used.

[E 2025-05-07 14:43:37.537 ServerApp] Exception in callback <bound method GPUResourceWebSocketHandler.send_data of <jupyterlab_nvdashboard.apps.gpu.GPUResourceWebSocketHandler object at 0xf96f511b0170>>
    Traceback (most recent call last):
      File "/home/jtomlinson/miniforge3/envs/rapids-25.04/lib/python3.12/site-packages/tornado/ioloop.py", line 937, in _run
        val = self.callback()
              ^^^^^^^^^^^^^^^
      File "/home/jtomlinson/miniforge3/envs/rapids-25.04/lib/python3.12/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 113, in send_data
        (stats["gpu_memory_total"] / gpu_mem_sum) * 100, 2
         ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
    ZeroDivisionError: float division by zero

(stats["gpu_memory_total"] / gpu_mem_sum) * 100, 2

It would be good to handle this more gracefully. Some graphs just fail to update, while others like the memory usage graph shows 18EB of memory.

Image

In this case we probably need to query the host memory via psutil and display that data instead.

Unfortunately this is only reproducible on machines where GPU memory is being reported by NVML as 0. But if you have such a system you can run the following script.

# memory_mre.py
# /// script
# requires-python = ">=3.12"
# dependencies = [
#     "nvidia-ml-py",
# ]
# ///

import pynvml
pynvml.nvmlInit()


print("Detecting GPU memory")
for gpu_idx in range(pynvml.nvmlDeviceGetCount()):
    handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
    print(f"GPU {gpu_idx}: {pynvml.nvmlDeviceGetMemoryInfo(handle).total} bytes")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions