-
Notifications
You must be signed in to change notification settings - Fork 393
Description
Environment
• Cluster: AKS 1.31.10
• OS Image: Ubuntu 22.04 (kernel 5.15.0-1092-azure)
• GPU Operator version: v25.3.3 (Helm deployment)
• NVIDIA driver: precompiled driver image (nvidia-driver-570-5.15.0-1092-azure-ubuntu22.04)
• NVIDIA container toolkit: installed by GPU Operator (/usr/local/nvidia/toolkit)
Steps to Reproduce
1. Deploy an AKS cluster with GPU nodepool (Ubuntu 22.04, kernel 5.15.0-1092-azure).
2. Install NVIDIA GPU Operator via Helm (v25.3.3).
3. Confirm driver DaemonSet installs precompiled driver package
4. Deploy a GPU workload (e.g., Triton Inference Server with resources.limits["nvidia.com/gpu"]=1).
Actual Behavior
• Device nodes only appear under:
/run/nvidia/driver/dev/nvidia0
/run/nvidia/driver/dev/nvidiactl
/run/nvidia/driver/dev/nvidia-uvm
/run/nvidia/driver/dev/nvidia-modeset
No /dev/nvidia* entries are created on the host.
Pod creation fails:
failed to generate spec: lstat /dev/nvidiactl: no such file or directory
Expected Behavior
• NVIDIA container runtime should detect them under /dev and allow GPU workloads to run.
Debugging Performed
• ✅ Precompiled driver 570.158.01 installed by DaemonSet without error.
• ✅ lsmod | grep nvidia confirms modules loaded.
• ✅ /run/nvidia/driver/dev/ has device nodes.
• ❌ /dev/nvidia* missing on host.
• ✅ NVIDIA container toolkit installed under /usr/local/nvidia/toolkit.
• BinaryName → /usr/local/toolkit/nvidia-container-runtime.
• ❌ which nvidia-container-runtime → not found.
• ✅ Toolkit DaemonSet logs confirm runtime binaries and config installed.
• ✅ Tried adding NVIDIA_VISIBLE_DEVICES=all and NVIDIA_DRIVER_CAPABILITIES=all.
• ❌ Triton pod still fails (runtime error occurs before container start).
ClusterPolicy Snippet
spec:
hostPaths:
driverInstallDir: /run/nvidia/driver
rootFS: /
devicePlugin:
enabled: true
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
Questions
• On AKS, should /dev/nvidia* device nodes be created directly on the host, or is it expected that they remain only under /run/nvidia/driver/dev/?
• Is GPU Operator supposed to configure NVIDIA container runtime differently when using precompiled driver images?