Description
PyNVML bindings are great to do all GPU information management from Python, but they are almost entirely an identical a copy of the C API. This can be a barrier for Python users who need to find out from the NVML API documentation what the API provides, and then what are the appropriate types that need to be passed, etc. We currently utilize PyNVML in both Distributed and Dask-CUDA, but there's also some overlap that leads to code duplication.
I feel one way to reduce code duplication and make it easier for new users, and thus make things overall better, is to provide a "High-level PyNVML library" that takes care of the basic needs for users. For example, I would imagine something like the following (but not limited to) to be available (implementation omitted for simplicity):
class Handle:
"""A handle to a GPU device.
Parameters
----------
index: int, optional
Integer representing the CUDA device index to get a handle to.
uuid: bytes or str, optional
UUID of a CUDA device to get a handle to.
Raises
------
ValueError
If neither `index` nor `uuid` are specified or if both are specified.
"""
def __init__(
self, index: Optional[int] = None, uuid: Optional[Union[bytes, str]] = None
)
@property
def free_memory(self) -> int:
"""
Free memory of the CUDA device.
"""
@property
def total_memory(self) -> int:
"""
Total memory of the CUDA device.
"""
@property
def used_memory(self) -> int:
"""
Used memory of the CUDA device.
"""
There would be more than the above to be covered, such as getting the number of available GPUs in the system, whether a GPU has a context currently created, if a handle is MIG or physical GPU, etc. Additionally, we would have simple tools that are generally useful, for example a small tool I wrote long ago to measure NVLink bandwidth and peak memory, and whatever else fits in the scope of a "High-level PyNVML library" that can make our users' lives easier.
So to begin this discussion I would like to know how people like @rjzamora and @kenhester feel about this idea. Would this be something that would fit in the scope of this project? Are there any impediments to adding such a library within the scope of this project/repository?
Also cc @quasiben for vis.