Description
Adapted from public Grafana Slack conversation.
We can get process PID from the bpf program and send it to the userspace along with other attributes via ring buffer.
In the userspace, we keep a cache of PIDs and their "relevant" environment variables like CUDA_VISIBLE_DEVICES , NVIDIA_VISIBILE_DEVICES , etc. Then we can lookup the environment variables of the received PID to find out the GPU UUID.
The tricky part is mapping the device ordinals in CUDA_VISIBLE_DEVICES to GPU UUIDs. Imagine there are 4 GPUs (0,1,2,3) on a node and the GPU 2 and 3 are bound to a given pod. Inside the pod the env var CUDA_VISIBLE_DEVICES will be set to 0,1 and not 2,3.
That means from the pod resources API, we need to first have a mapping of all pods and GPUs bound on them and then based on CUDA_VISIBLE_DEVICES we need to get correct GPU UUID.