Skip to content

When MIG uses the "mixed" mode, the nvidia-cuda-validator and nvidia-operator-validator Pods are always in the Init state #1738

@biqiangwu

Description

@biqiangwu

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug
My host only has one A100 card. When installing NVIDIA-Operator, the MIG mixed mode was enabled, and at this time, all Pods were normal. After specifying the MIG custom configuration and setting the corresponding labels for the nodes, the nvidia-cuda-validator and nvidia-operator-validator Pods remain in the init state

To Reproduce

helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v25.3.4 \
    --set mig.strategy=single

-------
All the Pods are functioning properly

kn get po 
NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-tn58t                                       1/1     Running     0             18s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj   1/1     Running     1 (96m ago)   111m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh   1/1     Running     1 (96m ago)   111m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v       1/1     Running     1 (96m ago)   111m
gpu-operator-75dff77d5c-4cctc                                     1/1     Running     1 (96m ago)   111m
nvidia-container-toolkit-daemonset-cqpcm                          1/1     Running     0             95s
nvidia-cuda-validator-7nxmf                                       0/1     Completed   0             10s
nvidia-dcgm-exporter-kt58n                                        1/1     Running     0             18s
nvidia-device-plugin-daemonset-njhr8                              1/1     Running     0             18s
nvidia-driver-daemonset-7hl9h                                     1/1     Running     0             2m7s
nvidia-mig-manager-4f4z5                                          1/1     Running     0             95s
nvidia-operator-validator-jtl8t                                   1/1     Running     0             95s

-------
Take effect with the custom MIG configuration

cat custom-mig-config.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-mig-config
  namespace: gpu-operator
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-disabled:
        - devices: all
          mig-enabled: false
      
      two-1g-one-2g:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.10gb": 2
            "2g.20gb": 1

k apply -f custom-mig-config.yaml

--------------
Use custom configuration

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
    --type='json' \
    -p='[{"op":"replace", "path":"/spec/migManager/config/name", "value":"custom-mig-config"}]'

--------------
All the Pods are functioning properly

kn get po 
NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-tn58t                                       1/1     Running     0             3m4s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj   1/1     Running     1 (99m ago)   114m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh   1/1     Running     1 (99m ago)   114m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v       1/1     Running     1 (99m ago)   114m
gpu-operator-75dff77d5c-4cctc                                     1/1     Running     1 (99m ago)   114m
nvidia-container-toolkit-daemonset-cqpcm                          1/1     Running     0             4m21s
nvidia-cuda-validator-7nxmf                                       0/1     Completed   0             2m56s
nvidia-dcgm-exporter-kt58n                                        1/1     Running     0             3m4s
nvidia-device-plugin-daemonset-njhr8                              1/1     Running     0             3m4s
nvidia-driver-daemonset-7hl9h                                     1/1     Running     0             4m53s
nvidia-mig-manager-ztdbj                                          1/1     Running     0             13s    # restart
nvidia-operator-validator-jtl8t                                   1/1     Running     0             4m21s

--------------
Label the nodes

k label nodes n1 nvidia.com/mig.config=two-1g-one-2g --overwrite

--------------
An error occurred

kn get po 
NAME                                                              READY   STATUS                  RESTARTS       AGE
gpu-feature-discovery-6rgp2                                       1/1     Running                 0              15s
gpu-operator-1759128897-node-feature-discovery-gc-599fbd57lztkj   1/1     Running                 1 (102m ago)   117m
gpu-operator-1759128897-node-feature-discovery-master-596bklwfh   1/1     Running                 1 (102m ago)   117m
gpu-operator-1759128897-node-feature-discovery-worker-4cj2v       1/1     Running                 1 (102m ago)   117m
gpu-operator-75dff77d5c-4cctc                                     1/1     Running                 1 (102m ago)   117m
nvidia-container-toolkit-daemonset-cqpcm                          1/1     Running                 0              7m7s
nvidia-cuda-validator-h7h8p                                       0/1     Init:CrashLoopBackOff   1 (12s ago)    13s    # error
nvidia-dcgm-exporter-z2jjp                                        1/1     Running                 0              15s
nvidia-device-plugin-daemonset-m9bct                              1/1     Running                 0              15s
nvidia-driver-daemonset-7hl9h                                     1/1     Running                 0              7m39s
nvidia-mig-manager-ztdbj                                          1/1     Running                 0              2m59s
nvidia-operator-validator-zg6kt                                   0/1     Init:2/4                0              16s

--------------
kn describe po nvidia-cuda-validator-h7h8p
.....
Events:
  Type     Reason   Age                From     Message
  ----     ------   ----               ----     -------
  Normal   Pulled   38s (x4 over 74s)  kubelet  Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v25.3.4" already present on machine
  Normal   Created  38s (x4 over 74s)  kubelet  Created container: cuda-validation
  Normal   Started  38s (x4 over 74s)  kubelet  Started container cuda-validation
  Warning  BackOff  9s (x7 over 73s)   kubelet  Back-off restarting failed container cuda-validation in pod nvidia-cuda-validator-h7h8p_gpu-operator(e5d64f72-86c8-4c90-936a-aa59a005abba)

--------------
kn exec -it nvidia-driver-daemonset-7hl9h -- nvidia-smi 
Mon Sep 29 09:07:14 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:02:00.0 Off |                   On |
| N/A   51C    P0             94W /  250W |       0MiB /  40960MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices:                                                                            |
+------------------+----------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |              Shared Memory-Usage |        Vol|        Shared         |
|      ID  ID  Dev |                Shared BAR1-Usage | SM     Unc| CE ENC  DEC  OFA  JPG |
|                  |                                  |        ECC|                       |
|==================+==================================+===========+=======================|
|  No MIG devices found                                                                   |
+-----------------------------------------------------------------------------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+


--------------
Everything is fine in single mode

k describe node n1 | grep Capacity -A 8
Capacity:
  cpu:                     20
  ephemeral-storage:       1966788624Ki
  hugepages-1Gi:           0
  hugepages-2Mi:           0
  memory:                  65231232Ki
  nvidia.com/gpu:          4
  nvidia.com/mig-1g.10gb:  4
  pods:                    110


Expected behavior
NVIDIA Operator is running normally

Environment (please provide the following information):

  • GPU Operator Version: [e.g. v25.3.4]
  • OS: [e.g. Ubuntu24.04]
  • Kernel Version: [6.8.0-84-generic]
  • Container Runtime Version: [e.g. containerd v2.1.4 ]
  • Kubernetes Distro and Version: [K8s v1.34.0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions