GPU allocation leak using NVIDIA_VISIBLE_DEVICES

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._

**Describe the bug**
I apply the pod resource by specific env `NVIDIA_VISIBLE_DEVICES`, e.g. `NVIDIA_VISIBLE_DEVICES==3,1,0,2` or
 `NVIDIA_VISIBLE_DEVICES==4,5,6,7`, then I  could sometime get 8 gpus in pod

<img width="1323" height="330" alt="Image" src="https://github.com/user-attachments/assets/79f5bc72-9e83-4fe4-8a86-dd30da364fe4" />

**To Reproduce**
```
...
resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/15b3_1021_0: "1"
                  rdma/15b3_1021_1: "1"
                  rdma/15b3_1021_2: "1"
                  rdma/15b3_1021_3: "1"
                requests:
                  nvidia.com/gpu: "4"
                  rdma/15b3_1021_0: "1"
                  rdma/15b3_1021_1: "1"
                  rdma/15b3_1021_2: "1"
                  rdma/15b3_1021_3: "1"
              securityContext:
                capabilities:
                  add:
                  - IPC_LOCK
                privileged: false
...
env:
              - name: NVIDIA_VISIBLE_DEVICES
                value: 0,1,2,3
...
```

**Expected behavior**
allocated normally

**Environment (please provide the following information):**
 - GPU Operator Version: [e.g. v25.3.0]
 - OS: [e.g. Ubuntu24.04]
 - Kernel Version: [e.g. 6.8.0-generic]
 - Container Runtime Version: [e.g. containerd 2.0.0]
 - Kubernetes Distro and Version: [e.g. K8s, OpenShift, Rancher, GKE, EKS]



**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh) script for debug data collected.

[must-gather.zip](https://github.com/user-attachments/files/22293280/must-gather.zip)

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU allocation leak using NVIDIA_VISIBLE_DEVICES #1693

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU allocation leak using NVIDIA_VISIBLE_DEVICES #1693

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions