Description
Problem Description
After adding the following option to deviceconfig .spec.driver
imageRegistryTLS:
insecure: true
insecureSkipTLSVerify: true
The image builder can push the image to the in-cluster registry, as shown in the registry log:
10.244.17.124 - - [18/Jun/2025:21:26:11 +0000] "GET /v2/amdgpu-driver/manifests/ubuntu-22.04-5.15.0-138-generic-6.4 HTTP/1.1" 200 1503 "" "go-containerregistry/v0.20.2"
The KMM worker cannot pull from the private registry, as the init container of the worker pod is trying to pull from the container using HTTPS. Below are the logs from the KMM worker:
kubectl describe pod kmm-worker-xxx-default -n amd-operator
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 57s (x4 over 2m21s) kubelet Pulling image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4"
Warning Failed 57s (x4 over 2m21s) kubelet Failed to pull image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4": failed to pull and unpack image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4": failed to resolve reference "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4": failed to do request: Head "https://twuni-docker-registry.amd-operator.svc.cluster.local:5000/v2/amdgpu-driver/manifests/ubuntu-22.04-5.15.0-138-generic-6.4": dial tcp: lookup twuni-docker-registry.amd-operator.svc.cluster.local: Temporary failure in name resolution
Warning Failed 57s (x4 over 2m21s) kubelet Error: ErrImagePull
Warning Failed 45s (x6 over 2m20s) kubelet Error: ImagePullBackOff
Normal BackOff 31s (x7 over 2m20s) kubelet Back-off pulling image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4"
and the KMM worker is in imagepullbackoff state:
amd-operator kmm-worker-xxx-default 0/1 Init:ImagePullBackOff 0 14m
Operating System
Ubuntu 22.04.5 LTS (Jammy Jellyfish)
CPU
Intel(R) Xeon(R) Platinum 8470
GPU
AMD Instinct MI300X VF
ROCm Version
ROCm 6.4
ROCm Component
No response
Steps to Reproduce
- Prepare a similar environment described above without pre-installed ROCm drivers.
- Set up private in-cluster registry twuni
helm repo add twuni https://helm.twun.io
helm install twuni twuni/docker-registry -n amd-operator --create-namespace
- Install AMD GPU Operator
- Create the device config with
.spec.driver
:
driver:
# enable operator to install out-of-tree amdgpu kernel module
enable: true
# blacklist is required for installing out-of-tree amdgpu kernel module
blacklist: true
# replace with incluster registry
image: twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver
imageRegistryTLS:
insecure: true
insecureSkipTLSVerify: true
# Specify the driver version by using ROCm version
version: "6.4"
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
k8s Client Version: v1.30.13
k8s Server Version: v1.30.13
gpu-operator Version: v1.3.0
The kmm module seems to be missing an additional registryTLS
setting near the end
below is the modules.kmm.sigs.x-k8s.io
in the cluster
kubectl describe modules.kmm.sigs.x-k8s.io -n amd-operator
Name: default
Namespace: amd-operator
Labels: <none>
Annotations: <none>
API Version: kmm.sigs.x-k8s.io/v1beta1
Kind: Module
Metadata:
Creation Timestamp: 2025-06-18T21:22:53Z
Finalizers:
kmm.node.kubernetes.io/module-finalizer
Generation: 1
Owner References:
API Version: amd.com/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: DeviceConfig
Name: default
UID: 0e7f0fdf-2aed-4311-bbe1-4b7afbd2fa96
Resource Version: 11031464
UID: 06c066dc-3039-4ffa-95ff-2da97363dd8e
Spec:
Module Loader:
Container:
In Tree Module To Remove:
Kernel Mappings:
Build:
Base Image Registry TLS:
Build Args:
Name: DRIVERS_VERSION
Value: 6.4
Name: REPO_URL
Value: https://repo.radeon.com
Dockerfile Config Map:
Name: ubuntu-22.04-default-amd-operator
Container Image: twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-${KERNEL_FULL_VERSION}-6.4
In Tree Module To Remove:
Literal: 5.15.0-138-generic
Regexp:
Registry TLS:
Insecure: true
Insecure Skip TLS Verify: true
Modprobe:
Args:
Dir Name: /opt
Firmware Path: firmwareDir/updates
Module Name: amdgpu
Modules Loading Order:
amdgpu
amdttm
amdkcl
Parameters:
ip_block_mask=0x7f
Registry TLS:
Version: 6.4
Service Account Name: amd-gpu-operator-kmm-module-loader
Selector:
feature.node.kubernetes.io/amd-vgpu: true
Tolerations:
Effect: NoSchedule
Key: amd-gpu-driver-upgrade
Operator: Equal
Value: true
Status:
Device Plugin:
Module Loader:
Desired Number: 1
Nodes Matching Selector Number: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BuildCreated 13m kmm Build created for kernel 5.15.0-138-generic
Normal BuildSucceeded 10m kmm Build job succeeded for kernel 5.15.0-138-generic
Notice the empty Registry TLS:
in the end after Parameters: ip_block_mask=0x7f
, but I am unsure if this is related to this issue.