-
Notifications
You must be signed in to change notification settings - Fork 393
Open
Description
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
My host only has one A100 card. When installing NVIDIA-Operator, the MIG mixed mode was enabled, and at this time, all Pods were normal. After specifying the MIG custom configuration and setting the corresponding labels for the nodes, the nvidia-cuda-validator and nvidia-operator-validator Pods remain in the init state
To Reproduce
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.3.4 \
--set mig.strategy=single
-------
All the Pods are functioning properly
kn get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-tn58t 1/1 Running 0 18s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj 1/1 Running 1 (96m ago) 111m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh 1/1 Running 1 (96m ago) 111m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v 1/1 Running 1 (96m ago) 111m
gpu-operator-75dff77d5c-4cctc 1/1 Running 1 (96m ago) 111m
nvidia-container-toolkit-daemonset-cqpcm 1/1 Running 0 95s
nvidia-cuda-validator-7nxmf 0/1 Completed 0 10s
nvidia-dcgm-exporter-kt58n 1/1 Running 0 18s
nvidia-device-plugin-daemonset-njhr8 1/1 Running 0 18s
nvidia-driver-daemonset-7hl9h 1/1 Running 0 2m7s
nvidia-mig-manager-4f4z5 1/1 Running 0 95s
nvidia-operator-validator-jtl8t 1/1 Running 0 95s
-------
Take effect with the custom MIG configuration
cat custom-mig-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-mig-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
all-disabled:
- devices: all
mig-enabled: false
two-1g-one-2g:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 2
"2g.20gb": 1
k apply -f custom-mig-config.yaml
--------------
Use custom configuration
kubectl patch clusterpolicies.nvidia.com/cluster-policy \
--type='json' \
-p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]'
--------------
All the Pods are functioning properly
kn get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-tn58t 1/1 Running 0 3m4s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj 1/1 Running 1 (99m ago) 114m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh 1/1 Running 1 (99m ago) 114m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v 1/1 Running 1 (99m ago) 114m
gpu-operator-75dff77d5c-4cctc 1/1 Running 1 (99m ago) 114m
nvidia-container-toolkit-daemonset-cqpcm 1/1 Running 0 4m21s
nvidia-cuda-validator-7nxmf 0/1 Completed 0 2m56s
nvidia-dcgm-exporter-kt58n 1/1 Running 0 3m4s
nvidia-device-plugin-daemonset-njhr8 1/1 Running 0 3m4s
nvidia-driver-daemonset-7hl9h 1/1 Running 0 4m53s
nvidia-mig-manager-ztdbj 1/1 Running 0 13s # restart
nvidia-operator-validator-jtl8t 1/1 Running 0 4m21s
--------------
Label the nodes
k label nodes n1 nvidia.com/mig.config=two-1g-one-2g --overwrite
--------------
An error occurred
kn get po
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-6rgp2 1/1 Running 0 15s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj 1/1 Running 1 (102m ago) 117m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh 1/1 Running 1 (102m ago) 117m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v 1/1 Running 1 (102m ago) 117m
gpu-operator-75dff77d5c-4cctc 1/1 Running 1 (102m ago) 117m
nvidia-container-toolkit-daemonset-cqpcm 1/1 Running 0 7m7s
nvidia-cuda-validator-h7h8p 0/1 Init:CrashLoopBackOff 1 (12s ago) 13s # error
nvidia-dcgm-exporter-z2jjp 1/1 Running 0 15s
nvidia-device-plugin-daemonset-m9bct 1/1 Running 0 15s
nvidia-driver-daemonset-7hl9h 1/1 Running 0 7m39s
nvidia-mig-manager-ztdbj 1/1 Running 0 2m59s
nvidia-operator-validator-zg6kt 0/1 Init:2/4 0 16s
--------------
kn describe po nvidia-cuda-validator-h7h8p
.....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 38s (x4 over 74s) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.4" already present on machine
Normal Created 38s (x4 over 74s) kubelet Created container: cuda-validation
Normal Started 38s (x4 over 74s) kubelet Started container cuda-validation
Warning BackOff 9s (x7 over 73s) kubelet Back-off restarting failed container cuda-validation in pod nvidia-cuda-validator-h7h8p_gpu-operator(e5d64f72-86c8-4c90-936a-aa59a005abba)
--------------
kn exec -it nvidia-driver-daemonset-7hl9h -- nvidia-smi
Mon Sep 29 09:07:14 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:02:00.0 Off | On |
| N/A 51C P0 94W / 250W | 0MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Shared Memory-Usage | Vol| Shared |
| ID ID Dev | Shared BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
--------------
Everything is fine in single mode
k describe node n1 | grep Capacity -A 8
Capacity:
cpu: 20
ephemeral-storage: 1966788624Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65231232Ki
nvidia.com/gpu: 4
nvidia.com/mig-1g.10gb: 4
pods: 110
Expected behavior
NVIDIA Operator is running normally
Environment (please provide the following information):
- GPU Operator Version: [e.g. v25.3.4]
- OS: [e.g. Ubuntu24.04]
- Kernel Version: [6.8.0-84-generic]
- Container Runtime Version: [e.g. containerd v2.1.4 ]
- Kubernetes Distro and Version: [K8s v1.34.0]
Metadata
Metadata
Assignees
Labels
No labels