Skip to content

GPU allocation leak using NVIDIA_VISIBLE_DEVICES #1693

@ltm920716

Description

@ltm920716

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
I apply the pod resource by specific env NVIDIA_VISIBLE_DEVICES, e.g. NVIDIA_VISIBLE_DEVICES==3,1,0,2 or
NVIDIA_VISIBLE_DEVICES==4,5,6,7, then I could sometime get 8 gpus in pod

Image

To Reproduce

...
resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/15b3_1021_0: "1"
                  rdma/15b3_1021_1: "1"
                  rdma/15b3_1021_2: "1"
                  rdma/15b3_1021_3: "1"
                requests:
                  nvidia.com/gpu: "4"
                  rdma/15b3_1021_0: "1"
                  rdma/15b3_1021_1: "1"
                  rdma/15b3_1021_2: "1"
                  rdma/15b3_1021_3: "1"
              securityContext:
                capabilities:
                  add:
                  - IPC_LOCK
                privileged: false
...
env:
              - name: NVIDIA_VISIBLE_DEVICES
                value: 0,1,2,3
...

Expected behavior
allocated normally

Environment (please provide the following information):

  • GPU Operator Version: [e.g. v25.3.0]
  • OS: [e.g. Ubuntu24.04]
  • Kernel Version: [e.g. 6.8.0-generic]
  • Container Runtime Version: [e.g. containerd 2.0.0]
  • Kubernetes Distro and Version: [e.g. K8s, OpenShift, Rancher, GKE, EKS]

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

must-gather.zip

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions