-
Notifications
You must be signed in to change notification settings - Fork 81
Open
Description
On systems with shared memory the total device memory returned by pynvml
is 0
. This results in some DivideByZero
errors in places where this value is used.
[E 2025-05-07 14:43:37.537 ServerApp] Exception in callback <bound method GPUResourceWebSocketHandler.send_data of <jupyterlab_nvdashboard.apps.gpu.GPUResourceWebSocketHandler object at 0xf96f511b0170>>
Traceback (most recent call last):
File "/home/jtomlinson/miniforge3/envs/rapids-25.04/lib/python3.12/site-packages/tornado/ioloop.py", line 937, in _run
val = self.callback()
^^^^^^^^^^^^^^^
File "/home/jtomlinson/miniforge3/envs/rapids-25.04/lib/python3.12/site-packages/jupyterlab_nvdashboard/apps/gpu.py", line 113, in send_data
(stats["gpu_memory_total"] / gpu_mem_sum) * 100, 2
~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~
ZeroDivisionError: float division by zero
(stats["gpu_memory_total"] / gpu_mem_sum) * 100, 2 |
It would be good to handle this more gracefully. Some graphs just fail to update, while others like the memory usage graph shows 18EB of memory.

In this case we probably need to query the host memory via psutil
and display that data instead.
Unfortunately this is only reproducible on machines where GPU memory is being reported by NVML as 0
. But if you have such a system you can run the following script.
# memory_mre.py
# /// script
# requires-python = ">=3.12"
# dependencies = [
# "nvidia-ml-py",
# ]
# ///
import pynvml
pynvml.nvmlInit()
print("Detecting GPU memory")
for gpu_idx in range(pynvml.nvmlDeviceGetCount()):
handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
print(f"GPU {gpu_idx}: {pynvml.nvmlDeviceGetMemoryInfo(handle).total} bytes")
Metadata
Metadata
Assignees
Labels
No labels