Skip to content

NVIDIA devices not visible under /dev on AKS GPU nodes with GPU Operator (Pre-compiled driver image) #1741

@schdhry

Description

@schdhry

Environment
• Cluster: AKS 1.31.10
• OS Image: Ubuntu 22.04 (kernel 5.15.0-1092-azure)
• GPU Operator version: v25.3.3 (Helm deployment)
• NVIDIA driver: precompiled driver image (nvidia-driver-570-5.15.0-1092-azure-ubuntu22.04)
• NVIDIA container toolkit: installed by GPU Operator (/usr/local/nvidia/toolkit)

Steps to Reproduce
1. Deploy an AKS cluster with GPU nodepool (Ubuntu 22.04, kernel 5.15.0-1092-azure).
2. Install NVIDIA GPU Operator via Helm (v25.3.3).
3. Confirm driver DaemonSet installs precompiled driver package
4. Deploy a GPU workload (e.g., Triton Inference Server with resources.limits["nvidia.com/gpu"]=1).

Actual Behavior
• Device nodes only appear under:

/run/nvidia/driver/dev/nvidia0
/run/nvidia/driver/dev/nvidiactl
/run/nvidia/driver/dev/nvidia-uvm
/run/nvidia/driver/dev/nvidia-modeset

No /dev/nvidia* entries are created on the host.
Pod creation fails:

failed to generate spec: lstat /dev/nvidiactl: no such file or directory

Expected Behavior
• NVIDIA container runtime should detect them under /dev and allow GPU workloads to run.

Debugging Performed
• ✅ Precompiled driver 570.158.01 installed by DaemonSet without error.
• ✅ lsmod | grep nvidia confirms modules loaded.
• ✅ /run/nvidia/driver/dev/ has device nodes.
• ❌ /dev/nvidia* missing on host.
• ✅ NVIDIA container toolkit installed under /usr/local/nvidia/toolkit.
• BinaryName → /usr/local/toolkit/nvidia-container-runtime.
• ❌ which nvidia-container-runtime → not found.
• ✅ Toolkit DaemonSet logs confirm runtime binaries and config installed.
• ✅ Tried adding NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES=all.
• ❌ Triton pod still fails (runtime error occurs before container start).

ClusterPolicy Snippet

spec:
  hostPaths:
    driverInstallDir: /run/nvidia/driver
    rootFS: /
  devicePlugin:
    enabled: true
    env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all

Questions
• On AKS, should /dev/nvidia* device nodes be created directly on the host, or is it expected that they remain only under /run/nvidia/driver/dev/?
• Is GPU Operator supposed to configure NVIDIA container runtime differently when using precompiled driver images?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions