Support getting the GPU device id for GPU metrics

Adapted from public Grafana Slack conversation.

We can get process PID from the bpf program and send it to the userspace along with other attributes via ring buffer.

In the userspace, we keep a cache of PIDs and their "relevant" environment variables like CUDA_VISIBLE_DEVICES , NVIDIA_VISIBILE_DEVICES , etc. Then we can lookup the environment variables of the received PID to find out the GPU UUID.

The tricky part is mapping the device ordinals in CUDA_VISIBLE_DEVICES to GPU UUIDs. Imagine there are 4 GPUs (0,1,2,3) on a node and the GPU 2 and 3 are bound to a given pod. Inside the pod the env var CUDA_VISIBLE_DEVICES will be set to 0,1  and not 2,3. 

That means from the pod resources API, we need to first have a mapping of all pods and GPUs bound on them and then based on CUDA_VISIBLE_DEVICES we need to get correct GPU UUID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support getting the GPU device id for GPU metrics #2014

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support getting the GPU device id for GPU metrics #2014

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions