Skip to content

Support getting the GPU device id for GPU metrics #2014

@grcevski

Description

@grcevski

Adapted from public Grafana Slack conversation.

We can get process PID from the bpf program and send it to the userspace along with other attributes via ring buffer.

In the userspace, we keep a cache of PIDs and their "relevant" environment variables like CUDA_VISIBLE_DEVICES , NVIDIA_VISIBILE_DEVICES , etc. Then we can lookup the environment variables of the received PID to find out the GPU UUID.

The tricky part is mapping the device ordinals in CUDA_VISIBLE_DEVICES to GPU UUIDs. Imagine there are 4 GPUs (0,1,2,3) on a node and the GPU 2 and 3 are bound to a given pod. Inside the pod the env var CUDA_VISIBLE_DEVICES will be set to 0,1 and not 2,3.

That means from the pod resources API, we need to first have a mapping of all pods and GPUs bound on them and then based on CUDA_VISIBLE_DEVICES we need to get correct GPU UUID.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Priority 3

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions