-
Notifications
You must be signed in to change notification settings - Fork 393
Open
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
I apply the pod resource by specific env NVIDIA_VISIBLE_DEVICES
, e.g. NVIDIA_VISIBLE_DEVICES==3,1,0,2
or
NVIDIA_VISIBLE_DEVICES==4,5,6,7
, then I could sometime get 8 gpus in pod

To Reproduce
...
resources:
limits:
nvidia.com/gpu: "4"
rdma/15b3_1021_0: "1"
rdma/15b3_1021_1: "1"
rdma/15b3_1021_2: "1"
rdma/15b3_1021_3: "1"
requests:
nvidia.com/gpu: "4"
rdma/15b3_1021_0: "1"
rdma/15b3_1021_1: "1"
rdma/15b3_1021_2: "1"
rdma/15b3_1021_3: "1"
securityContext:
capabilities:
add:
- IPC_LOCK
privileged: false
...
env:
- name: NVIDIA_VISIBLE_DEVICES
value: 0,1,2,3
...
Expected behavior
allocated normally
Environment (please provide the following information):
- GPU Operator Version: [e.g. v25.3.0]
- OS: [e.g. Ubuntu24.04]
- Kernel Version: [e.g. 6.8.0-generic]
- Container Runtime Version: [e.g. containerd 2.0.0]
- Kubernetes Distro and Version: [e.g. K8s, OpenShift, Rancher, GKE, EKS]
Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
- kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
- If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
- If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
- Output from running
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
- containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
Metadata
Metadata
Assignees
Labels
No labels