Skip to content

[Issue]: Unable to Use Insecure Private Registry #243

Open
@CcccYxx

Description

@CcccYxx

Problem Description

After adding the following option to deviceconfig .spec.driver

imageRegistryTLS:
    insecure: true
    insecureSkipTLSVerify: true

The image builder can push the image to the in-cluster registry, as shown in the registry log:

10.244.17.124 - - [18/Jun/2025:21:26:11 +0000] "GET /v2/amdgpu-driver/manifests/ubuntu-22.04-5.15.0-138-generic-6.4 HTTP/1.1" 200 1503 "" "go-containerregistry/v0.20.2"

The KMM worker cannot pull from the private registry, as the init container of the worker pod is trying to pull from the container using HTTPS. Below are the logs from the KMM worker:

kubectl describe pod kmm-worker-xxx-default -n amd-operator
Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Normal   Pulling  57s (x4 over 2m21s)  kubelet  Pulling image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4"
  Warning  Failed   57s (x4 over 2m21s)  kubelet  Failed to pull image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4": failed to pull and unpack image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4": failed to resolve reference "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4": failed to do request: Head "https://twuni-docker-registry.amd-operator.svc.cluster.local:5000/v2/amdgpu-driver/manifests/ubuntu-22.04-5.15.0-138-generic-6.4": dial tcp: lookup twuni-docker-registry.amd-operator.svc.cluster.local: Temporary failure in name resolution
  Warning  Failed   57s (x4 over 2m21s)  kubelet  Error: ErrImagePull
  Warning  Failed   45s (x6 over 2m20s)  kubelet  Error: ImagePullBackOff
  Normal   BackOff  31s (x7 over 2m20s)  kubelet  Back-off pulling image "twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-5.15.0-138-generic-6.4"

and the KMM worker is in imagepullbackoff state:

amd-operator    kmm-worker-xxx-default                                  0/1     Init:ImagePullBackOff   0          14m

Operating System

Ubuntu 22.04.5 LTS (Jammy Jellyfish)

CPU

Intel(R) Xeon(R) Platinum 8470

GPU

AMD Instinct MI300X VF

ROCm Version

ROCm 6.4

ROCm Component

No response

Steps to Reproduce

  1. Prepare a similar environment described above without pre-installed ROCm drivers.
  2. Set up private in-cluster registry twuni
helm repo add twuni https://helm.twun.io
helm install twuni twuni/docker-registry -n amd-operator --create-namespace
  1. Install AMD GPU Operator
  2. Create the device config with .spec.driver:
driver:
  # enable operator to install out-of-tree amdgpu kernel module
  enable: true
  # blacklist is required for installing out-of-tree amdgpu kernel module
  blacklist: true
  # replace with incluster registry
  image: twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver
  imageRegistryTLS:
    insecure: true
    insecureSkipTLSVerify: true
  # Specify the driver version by using ROCm version
  version: "6.4"

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

k8s Client Version: v1.30.13
k8s Server Version: v1.30.13
gpu-operator Version: v1.3.0

The kmm module seems to be missing an additional registryTLS setting near the end
below is the modules.kmm.sigs.x-k8s.io in the cluster

kubectl describe  modules.kmm.sigs.x-k8s.io -n amd-operator
Name:         default
Namespace:    amd-operator
Labels:       <none>
Annotations:  <none>
API Version:  kmm.sigs.x-k8s.io/v1beta1
Kind:         Module
Metadata:
  Creation Timestamp:  2025-06-18T21:22:53Z
  Finalizers:
    kmm.node.kubernetes.io/module-finalizer
  Generation:  1
  Owner References:
    API Version:           amd.com/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  DeviceConfig
    Name:                  default
    UID:                   0e7f0fdf-2aed-4311-bbe1-4b7afbd2fa96
  Resource Version:        11031464
  UID:                     06c066dc-3039-4ffa-95ff-2da97363dd8e
Spec:
  Module Loader:
    Container:
      In Tree Module To Remove:  
      Kernel Mappings:
        Build:
          Base Image Registry TLS:
          Build Args:
            Name:   DRIVERS_VERSION
            Value:  6.4
            Name:   REPO_URL
            Value:  https://repo.radeon.com
          Dockerfile Config Map:
            Name:                  ubuntu-22.04-default-amd-operator
        Container Image:           twuni-docker-registry.amd-operator.svc.cluster.local:5000/amdgpu-driver:ubuntu-22.04-${KERNEL_FULL_VERSION}-6.4
        In Tree Module To Remove:  
        Literal:                   5.15.0-138-generic
        Regexp:                    
        Registry TLS:
          Insecure:                  true
          Insecure Skip TLS Verify:  true
      Modprobe:
        Args:
        Dir Name:       /opt
        Firmware Path:  firmwareDir/updates
        Module Name:    amdgpu
        Modules Loading Order:
          amdgpu
          amdttm
          amdkcl
        Parameters:
          ip_block_mask=0x7f
      Registry TLS:
      Version:             6.4
    Service Account Name:  amd-gpu-operator-kmm-module-loader
  Selector:
    feature.node.kubernetes.io/amd-vgpu:  true
  Tolerations:
    Effect:    NoSchedule
    Key:       amd-gpu-driver-upgrade
    Operator:  Equal
    Value:     true
Status:
  Device Plugin:
  Module Loader:
    Desired Number:                  1
    Nodes Matching Selector Number:  1
Events:
  Type    Reason          Age   From  Message
  ----    ------          ----  ----  -------
  Normal  BuildCreated    13m   kmm   Build created for kernel 5.15.0-138-generic
  Normal  BuildSucceeded  10m   kmm   Build job succeeded for kernel 5.15.0-138-generic

Notice the empty Registry TLS: in the end after Parameters: ip_block_mask=0x7f, but I am unsure if this is related to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions