Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omni : problem to use an OnPrem image-factory #711

Open
1 task done
flpajany opened this issue Oct 27, 2024 · 10 comments
Open
1 task done

Omni : problem to use an OnPrem image-factory #711

flpajany opened this issue Oct 27, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@flpajany
Copy link

flpajany commented Oct 27, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

I have installed an image-factory (v0.5.0) in a docker running on my machine where omni OnPrem is running too.
Here is the command I execute :

docker run \
-d \
--net=host \
--cap-add=NET_ADMIN \
--device /dev/net/tun \
--name omni \
--restart unless-stopped \
-v /root/omni/etcd:/_out/etcd \
-v /root/omni/tls.crt:/tls.crt \
-v /root/omni/tls.key:/tls.key \
-v /root/omni/omni.asc:/omni.asc \
-v /root/omni/descriptor.xml:/saml-descriptor \
-v /root/omni/certs:/etc/ssl/certs \
siderolabs/omni:v0.42.3 \
--account-id=$(cat /root/omni/omni-account-uuid) \
--name=onprem-omni \
--enable-break-glass-configs \
--private-key-source=file:///omni.asc \
--event-sink-port=8091 \
--cert=/tls.crt \
--key=/tls.key \
--machine-api-cert=/tls.crt \
--machine-api-key=/tls.key \
--bind-addr=0.0.0.0:443 \
--machine-api-bind-addr=0.0.0.0:8090 \
--k8s-proxy-bind-addr=0.0.0.0:8100 \
--advertised-api-url=https://omni-test.<mydomaine>/ \
--siderolink-api-advertised-url=https://omni-test.<mydomaine>:8090/ \
--siderolink-wireguard-advertised-addr=10.144.18.178:50180 \
--advertised-kubernetes-proxy-url=https://omni-test.<mydomaine>:8100/ \
--auth-saml-enabled=true \
--talos-installer-registry=<mylocal registry>:5005/siderolabs/installer \
--image-factory-pxe-address=https://factory-talos-test.<mydomaine>/ \
--image-factory-address=https://factory-talos-test.<mydomaine>/ \
--auth-saml-metadata=/saml-descriptor

But when I launch an upgrade for a machine, in the logs, I found this line :

[talos] task upgrade (1/1): performing upgrade via "factory.talos.dev/installer/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba:v1.7.7"

It is trying to download the installer from the official factory and not my own.

Asking for help.

Thanks,
Regards

Expected Behavior

I wish to find this line when my machine lauched a talos upgrade :

[talos] task upgrade (1/1): performing upgrade via "factory-talos-test.<mydomaine>/installer/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba:v1.7.7"

Steps To Reproduce

  1. Running a local factory (onPrem)
  2. Launching omni with option --image-factory-address= (onPrem too)

What browsers are you seeing the problem on?

Chrome

Anything else?

I hacked omni code to make it works with standalone machine (in maintenance mode) by doing this :
But to let you know, I make it works by customizing and recompiling omni. factory.talos.dev is hardcoded inside de truescript part :

diff --git a/frontend/src/methods/machine.ts b/frontend/src/methods/machine.ts
index b57084c..c2ffc91 100644
--- a/frontend/src/methods/machine.ts
+++ b/frontend/src/methods/machine.ts
@@ -148,8 +148,8 @@ const copyUserLabels = (src: Resource, dst: Resource) => {

export const updateTalosMaintenance = async (machine: string, talosVersion: string, schematic?: string) => {
const image = schematic ?
- `factory.talos.dev/installer/${schematic}:v${talosVersion}` :
- `ghcr.io/siderolabs/installer:v${talosVersion}`;
+ `factory-talos-test.<mydomaine>/installer/${schematic}:v${talosVersion}` :
+ `app01-lvn-re.phys.rece:5005/siderolabs/installer:v${talosVersion}`;

await MachineService.Upgrade({image}, withRuntime(Runtime.Talos), withContext({
nodes: [machine]

But unfortunely, it is not working for nodes inside a configured cluster : it is still trying to contact factory.talos.dev.

And this time, I can not find any mention of factory.talos.dev in any code (except for tests)...

@flpajany flpajany added the bug Something isn't working label Oct 27, 2024
@6547709
Copy link

6547709 commented Oct 29, 2024

I have verified that Image Factory and Omni can be upgraded normally in an on-prem environment. I think there should be a problem with your Omni configuration.

You need to configure registry-mirror for omni and point "factory.talos.dev" to your own "image-factory" address;
---omni run args---

      --registry-mirror=docker.io=https://registry.corp.local,gcr.io=https://registry.corp.local,ghcr.io=https://registry.corp.local,registry.k8s.io=https://registry.corp.local,factory.talos.dev=https://factory.corp.local

---Talos Update Logs(1.7.6 to 1.7.7)---

[talos] task upgrade (1/1): performing upgrade via "factory.talos.dev/installer/2375290fcd8dd9f8a5726dca2d320fa9101f81f2dbc08aa1e7cb5c85616dd723:v1.7.7"

---Image Factory Logs---

{"level":"info","ts":1730209640.9354632,"caller":"http/http.go:164","msg":"request","frontend":"http","method":"GET","path":"/v2/installer/2375290fcd8dd9f8a5726dca2d320fa9101f81f2dbc08aa1e7cb5c85616dd723/blobs/sha256:52e6dc076330c06f1e50d74105d7072d361287f04a3fc525ee1d9cf98855c3c0"}

@flpajany
Copy link
Author

flpajany commented Oct 29, 2024

It is not working for me because when I launched the upgrade, the first node (a controlplane one) is trying to access to factory.talos.dev to get its image and not my OnPrem factory. And since it can't access Internet (we are behind a FW), it is failing. I think that omni is passing by default "factory.talos.dev" instead of my factory URL to node.


[talos] retrying error: failed to pull image "factory.talos.dev/installer/55e5f0fbfa0cee42023a4b8c92181dc72c8cf8fc637d748ab8885301a4ce51a8:v1.7.7": failed to resolve reference "factory.talos.dev/installer/55e5f0fbfa0cee42023a4b8c92181dc72c8cf8fc637d748ab8885301a4ce51a8:v1.7.7": failed to do request: Head "https://factory.talos.dev/v2/installer/55e5f0fbfa0cee42023a4b8c92181dc72c8cf8fc637d748ab8885301a4ce51a8/manifests/v1.7.7": dial tcp: lookup factory.talos.dev on 10.16.16.1:53: no such host

@6547709
Copy link

6547709 commented Oct 30, 2024

I use it in the Airgap environment without any problems. Let's analyze the principle;

  1. Talos Linux uses Containerd and uses registry-mirros to connect to the local registry;
  2. There are two types of local registries: 1) docker.io, ghcr.io...; 2) factory.talos.dev;
  3. When pulling images from the local Image-Factory, registry-mirros works; the URL you see at this time must be the original one (for example: factory.talos.dev), but Containerd will pull the image from the local Image-Factory;
pull factory.talos.dev/xxx:1.7.7 ->Containerd->Containerd Mirrors->Image-Factory(Local);
  1. From your log, the domain name is not correctly resolved. Although it appears to be "factory.talos.dev", if you have configured registry-mirrors, it should actually be the address you configured; then you need to verify whether your DNS can correctly resolve the configured domain name (https://factory-talos-test.), or you can directly change it to the IP address;

@6547709
Copy link

6547709 commented Oct 30, 2024

Other troubleshooting steps:

  1. Check Machine Config to confirm that Mirrors are configured correctly (http:// must be included);
  2. If there is a problem with DNS, try adding Hosts through Machine Config;
  3. Try using the IP address;

@flpajany
Copy link
Author

Thank you for your answer. But unfortunately it won't work because the problem seems to be that omni is telling my nodes to download the installer from the official factory (factory.talos.dev) and since my nodes don't have a registry mirror configured, they are trying to download it directly from Internet. But Internet is not reachable from the network where they are (and DNS do not resolve factory.talos.dev of course).

Nevertheless, I have found a workaround with your suggestions. I added theses lines in my cluster patches :

      machine:
        registries:
          mirrors:
            factory.talos.dev:
              endpoints:
                - https://factory-talos-test.<mydomain>
              overridePath: false

And it works !
But I would like to say that it is not a great solution : omni should tell my cluster and especially my talos nodes in my cluster to download installer from my local factory and not using the official factory.

So I keep my issue open.

@6547709
Copy link

6547709 commented Oct 30, 2024

I'm glad to see that you solved the problem.
Omni supports passing Registry-mirrors via parameters (added in the Docker Compose configuration file), refer to the first line of code in my first reply. In this way, there is no need to add any configuration to the machine config.

--registry-mirror=

@flpajany
Copy link
Author

I'm glad to see that you solved the problem. Omni supports passing Registry-mirrors via parameters (added in the Docker Compose configuration file), refer to the first line of code in my first reply. In this way, there is no need to add any configuration to the machine config.

--registry-mirror=

I tried this and it does not work. Sorry.

@6547709
Copy link

6547709 commented Oct 30, 2024

I can confirm that my --registry-mirrors is working.
You are using version 0.42.3, I am using version 0.43.1; that is the only difference,
Here is my config for reference;

  omni:
    image: ghcr.io/siderolabs/omni:v0.43.1
    container_name: omni
    restart: unless-stopped
    network_mode: host
    cap_add:
      - NET_ADMIN
    volumes:
      - ./etcd:/_out/etcd
      - ./certs/omni-chain.pem:/omni-chain.pem
      - ./certs/omni-key.pem:/omni-key.pem
      - ./certs/omni-ca.pem:/etc/ssl/certs/omni-ca.pem
      - ./omni.asc:/omni.asc
    devices:
      - "/dev/net/tun:/dev/net/tun"
    command: >
      --account-id=${OMNI_ACCOUNT_UUID}
      --cert=/omni-chain.pem
      --key=/omni-key.pem
      --siderolink-api-cert=/omni-chain.pem
      --siderolink-api-key=/omni-key.pem
      --private-key-source=file:///omni.asc
      --event-sink-port=8091
      --bind-addr=0.0.0.0:443
      --siderolink-api-bind-addr=0.0.0.0:8090
      --k8s-proxy-bind-addr=0.0.0.0:8100
      --advertised-api-url=https://${OMNI_IP}:443/
      --siderolink-api-advertised-url=https://${OMNI_IP}:8090/
      --siderolink-wireguard-advertised-addr=${OMNI_IP}:50180
      --advertised-kubernetes-proxy-url=https://${OMNI_IP}:8100/
      --auth-auth0-enabled=false
      --auth-saml-enabled
      --auth-saml-url=https://${OMNI_IP}:8443/realms/omni/protocol/saml/descriptor
      --talos-installer-registry=${OMNI_IP}:5000/siderolabs/installer
      --kubernetes-registry=${OMNI_IP}:5000/siderolabs/kubelet
      --image-factory-address=http://${OMNI_IP}:8080
      --registry-mirror=docker.io=http://${OMNI_IP}:5000,gcr.io=http://${OMNI_IP}:5000,ghcr.io=http://${OMNI_IP}:5000,registry.k8s.io=http://${OMNI_IP}:5000,factory.talos.dev=http://${OMNI_IP}:8080

@6547709
Copy link

6547709 commented Oct 30, 2024

@flpajany I know why it doesn't work, because the --registry-mirror parameter needs to be configured before creating the cluster.
If the cluster has already been created (registry-mirror is not configured), then you can only push it through the Machine Path. Newly created clusters should automatically include mirrors;

@flpajany
Copy link
Author

flpajany commented Oct 30, 2024

@6547709 Wouah it works with 0.43.2 when machines were in a cluster ! Thank you (I really would like to know how it works).

Unfortunatly, when machines are in maintenance mode, they still cannot figure out that factory.talos.dev is in fact my factory


[talos] retrying error: failed to pull image "factory.talos.dev/installer/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba:v1.7.7": failed to resolve reference "factory.talos.dev/installer/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba:v1.7.7": failed to do request: Head "https://factory.talos.dev/v2/installer/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/manifests/v1.7.7": dial tcp: lookup factory.talos.dev on 10.16.16.1:53: no such host

But with my "hack", it is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants