Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CML Kubernetes self-hosted runner is registered to GitHub but the workflow never continues #1415

Open
ludelafo opened this issue Aug 16, 2023 · 7 comments

Comments

@ludelafo
Copy link
Contributor

Hi CML team,

I'm facing an issue with CML when creating a self-hosted runner for GitHub on a Google Cloud Kubernetes cluster.

The runner is created and seems to register to GitHub. However, the workflow never continues and hangs on

"***"level":"info","message":"iterative_cml_runner.runner: Still creating... [hhmsss elapsed]"***"

I'm using the following steps to create the runner:

  1. Create a personal access token (PAT) with repo scope.
  2. Store the PAT in a GitHub repository secret named CML_PAT.
  3. Create a Google Service Account Key to allow access to the Kubernetes cluster
  4. Store the Service Account Key in a GitHub repository secret named GCP_SERVICE_ACCOUNT_KEY.
  5. Create a GitHub Workflow file with the following content:
    name: Workflow from actions
    
    on:
      push:
    
    jobs:
      setup-runner:
        runs-on: ubuntu-latest
        steps:
          - name: Checkout repository
            uses: actions/checkout@v3
          ## Google Cloud
          - name: Login to Google Cloud
            uses: 'google-github-actions/auth@v1'
            with:
              credentials_json: '${{ secrets.GCP_SERVICE_ACCOUNT_KEY }}'
          - name: Get Google Cloud's Kubernetes credentials
            uses: 'google-github-actions/get-gke-credentials@v1'
            with:
              cluster_name: 'mlops-workshop'
              location: 'europe-west6-a'
          ## CML
          - name: Setup Node
            uses: actions/setup-node@v3
            with:
              node-version: '16'
          - name: Setup CML
            uses: iterative/setup-cml@v1
          - name: Initialize runner on Kubernetes
            env:
              REPO_TOKEN: ${{ secrets.CML_PAT }}
            run: |
              export KUBERNETES_CONFIGURATION=$(cat $KUBECONFIG)
              # https://cml.dev/doc/ref/runner
              # https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type
              # https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#{cpu}-{memory}
              cml runner launch \
                --labels="cml-runner-from-actions" \
                --cloud="kubernetes" \
                --cloud-type="s"
    
      use-runner:
        needs: setup-runner
        runs-on: [self-hosted, cml-runner-from-actions]
        steps:
          - name: Checkout repository
            uses: actions/checkout@v3
          # Node is required to run CML
          - name: Setup Node
            uses: actions/setup-node@v3
            with:
              node-version: '16'
          - name: Setup CML
            uses: iterative/setup-cml@v1
          - name: Create CML report
            env:
              REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
            run: |
              echo "It's a success!" >> report.md
              cml comment update --publish report.md
  6. Create a commit to trigger the workflow.

Here are some logs that might help you:

Logs of the runner just after the start
$ kubectl logs cml-hothxdswe6-5u5j99rk-34bk17gn-gdmc8
Failed to get unit file state for cml.service: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 84.5M  100 84.5M    0     0  27.8M      0  0:00:03  0:00:03 --:--:-- 41.1M
bash: line 23: lsof: command not found
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 290ms"}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 249ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.5"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 309ms"}
{"date":"2023-08-16T14:14:13.820Z","level":"info","message":"runner status","repo":"https://github.com/csia-pme/cml-with-tpi-from-sources","status":"ready"}
Logs of the runner after some time
$ kubectl logs cml-hothxdswe6-5u5j99rk-34bk17gn-gdmc8
Failed to get unit file state for cml.service: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 84.5M  100 84.5M    0     0  27.8M      0  0:00:03  0:00:03 --:--:-- 41.1M
bash: line 23: lsof: command not found
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 290ms"}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 249ms"}
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.5.5"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 309ms"}
{"date":"2023-08-16T14:14:13.820Z","level":"info","message":"runner status","repo":"https://github.com/csia-pme/cml-with-tpi-from-sources","status":"ready"}
{"level":"info","message":"Unregistering runner cml-hothxdswe6-5u5j99rk-34bk17gn..."}
{"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 301ms"}
{"level":"info","message":"DELETE /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/5 - 204 in 410ms"}
{"level":"info","message":"\tSuccess"}
{"level":"info","message":"Waiting 10 seconds to destroy"}
Logs of the GitHub workflow
***"level":"info","message":"POST /repos/csia-pme/cml-with-tpi-from-sources/actions/runners/registration-token - 201 in 133ms"***
***"level":"info","message":"GET /repos/csia-pme/cml-with-tpi-from-sources/actions/runners?per_page=100 - 200 in 130ms"***
***"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."***
***"level":"warn","message":"ignoring RUNNER_NAME environment variable, use CML_RUNNER_NAME or --name instead"***
***"level":"info","message":"Preparing workdir /home/runner/.cml/hothxdswe6..."***
***"level":"info","message":"Deploying cloud runner plan..."***
***"level":"info","message":"Terraform apply..."***
***"level":"info","message":"Terraform 1.5.4"***
***"level":"info","message":"iterative_cml_runner.runner: Plan to create"***
***"level":"info","message":"Plan: 1 to add, 0 to change, 0 to destroy."***
***"level":"info","message":"iterative_cml_runner.runner: Creating..."***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [1m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [2m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [3m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [4m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [5m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [6m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [7m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [8m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [9m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [10m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [11m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [12m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [13m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [14m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [15m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [16m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [17m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [18m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m10s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m20s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m30s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m40s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [19m50s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Still creating... [20m0s elapsed]"***
***"level":"info","message":"iterative_cml_runner.runner: Creation errored after 20m3s"***

I was able to check if the runner was successfully able to register to GitHub by running the following command (from the GitHub API documentation):

curl -L \
    -H "Accept: application/vnd.github+json" \
    -H "Authorization: Bearer MY_CML_PAT" \
    -H "X-GitHub-Api-Version: 2022-11-28" \
    https://api.github.com/repos/csia-pme/cml-with-tpi-from-sources/actions/runners
Output of the cURL command
$ curl -L \
    -H "Accept: application/vnd.github+json" \
    -H "Authorization: Bearer MY_CML_PAT" \
    -H "X-GitHub-Api-Version: 2022-11-28" \
    https://api.github.com/repos/csia-pme/cml-with-tpi-from-sources/actions/runners
{
  "total_count": 1,
  "runners": [
    {
      "id": 5,
      "name": "cml-hothxdswe6-5u5j99rk-34bk17gn",
      "os": "Linux",
      "status": "online",
      "busy": false,
      "labels": [
        {
          "id": 1,
          "name": "self-hosted",
          "type": "read-only"
        },
        {
          "id": 2,
          "name": "Linux",
          "type": "read-only"
        },
        {
          "id": 3,
          "name": "X64",
          "type": "read-only"
        },
        {
          "id": 5,
          "name": "cml-runner-from-actions",
          "type": "custom"
        }
      ]
    }
  ]
}

You can find a repository with the code used to reproduce this issue here.

I created two workflows to test the runner:

You can find the execution of the two workflows here and here.

I did try all sorts of things to try to make it work, but I was not able to find a solution. I tried to:

  • Use a different runner (I tried with a runner with different specs and on a different GitHub repository)
  • Set a PAT with all the scopes
  • Set a PAT with only the repo scope
  • Add permissions to the GitHub workflow file
  • Tried older versions of CML (0.18.x)
  • Make usage of the hidden --cloud-image="iterativeai/cml:0-dvc3-base1-gpu", --tpi-version="= 0.11.18" and --cml-version="0.19.0" arguments to set older versions of CML and TPI
  • Build from sources

Please let me know if I can be of any help and thank you!

@dacbd
Copy link
Contributor

dacbd commented Aug 21, 2023

@ludelafo I think the issue is that the executing command cml runner launch from the setup-runner job attempts to connect over ssh for a readiness check, if that is not routable and it fails to connect then after 20m it tries to clean up all the created resources.

The two solutions I see are:

  • allow a public IP to be assigned to access the resources created by cml
  • run a small (permanent) github actions service on the cluster to act as the "job starter" which should be routable to the created cml created pod on the internal network.

@ludelafo
Copy link
Contributor Author

Hi @dacbd, thank you for your input.

I'll investigate more on my side to check if I can fix the issue.

What questions me is that I remember to have the same set up previously and it worked out of the box.

I'll get back to you if I find something.

@ludelafo
Copy link
Contributor Author

Hello @dacbd,

After a few months working on other projects, I'm back on CML/MLOps principles.

After updating all packages to check if this issue is resolved, my team and I are still having troubles to use CML with Kubernetes and GitHub Actions.

In order to try to identify the problem, I created a minimal reproducible example that you can find here: https://github.com/swiss-ai-center/cml-kubernetes-github-actions-runner-minimal-reproducible-example. It contains all the steps to reproduce the issue and open questions for more investigating.

We are three people looking into this issue and weren't able to find a solution. I'll tag them (@rmarquis, @leonardcser) so they can intervene in the conversation if necessary.

We are highly motivated to help Iterative fix this issue, so please let us know how we can help!

Thanks in advance,
Ludovic

@dacbd
Copy link
Contributor

dacbd commented Feb 29, 2024

@ludelafo I'm sorry I dont have much capacity to help you, and I'm not sure how busy @0x2b3bfa0 is.

A few things I would recommend:

inspecting to cluster to make sure the pod is even being created

also going into your gcp logs explorer and inspecting the API calls/activity to make sure nothing is being denied or missing.

CML generates a ssh key that is used for the instance. You can run the command locally using your own ssh key (there should be a few examples in the docs) and then try and ssh into it your self and inspect the contents for errors. (CML does it's readiness check via ssh)

@neemias-carvalho-movti
Copy link

Hi @dacbd and @ludelafo,

I’m encountering a similar issue with CML while setting up a self-hosted runner for GitHub on a Azure Kubernetes cluster. Despite the runner being created and registering successfully with GitHub, the workflow hangs with the log message:

{"level":"info","message":"iterative_cml_runner.runner: Still creating... [20m0s elapsed]"}

Interestingly, I have an operational setup using AKS 1.23 with the same code, but I encounter the error when trying to execute the pipeline via GitHub Actions in an AKS 1.27 or 1.28 environment.

I note diferent log of the pods when executing in an AKS 1.23 and when I execute using an AKS 1.27 or 1.28:

AKS 1.23
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.3.7"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/xxxxxx/xxxxxxxxxxx/actions/runners/registration-token - 201 in 464ms"}
{"date":"2024-07-27T02:57:27.576Z","level":"info","message":"runner status","repo":"https://github.com/xxxxxxxx/xxxxxxxx","status":"ready"}

AKS 1.27 or 1.28
{"level":"warn","message":"Github Actions timeout has been updated from 72h to 35 days. Update your workflow accordingly to be able to restart it automatically."}
{"level":"info","message":"Preparing workdir /home/runner..."}
{"level":"info","message":"Launching github runner"}
{"level":"info","message":"Terraform 1.8.4"}
{"level":"info","message":"Plan: 0 to add, 0 to change, 0 to destroy."}
{"level":"info","message":"Apply complete! Resources: 0 added, 0 changed, 0 destroyed."}
{"level":"info","message":"Outputs: 0"}
::warning::cloud credentials are no longer available on self-hosted runner steps; please use step.env and secrets instead
{"level":"warn","message":"Error connecting to ACPI socket: connect ENOENT /var/run/acpid.socket. The acpid.service helps with instance termination detection."}
{"level":"info","message":"POST /repos/xxxxxxxx/xxxxxxxxx/actions/runners/registration-token - 201 in 481ms"}
{"date":"2024-07-27T02:10:39.655Z","level":"info","message":"runner status","repo":"https://github.com/xxxxxx/xxxxxxx","status":"ready"}

Any guidance or assistance you can provide would be greatly appreciated. We are keen to resolve this and are open to any further investigation or adjustments needed.

Thanks in advance,

Neemias

@ludelafo
Copy link
Contributor Author

Hi @neemias-carvalho-movti,

Thank you very much for your insights, I'm glad I'm not the only person to have these issues.

My suspicions were the same as yours. I do think it's a difference in compatibility between versions of Kubernetes that I haven't been able to test on my own: Google Cloud automatically updates Kubernetes clusters and I wasn't able to go back to an old enough version to validate this point.

Fortunately, thanks to your feedback, I think Iterative now has a new lead.

I do not work on this project anymore, but my colleagues might be able to help you if needed.

Looking forward to seeing any improvements on this!
Ludovic

@rmarquis
Copy link

@neemias-carvalho-movti We didn't find a solution to this issue on our side. Good to know that it is seemingly the Kubernetes version that might be responisble here.

For our use case, we eventually side-stepped the problem by using two runners: a standard GitHub action that instantiates the k8s cluster, a second one that trains the model with GPU support and generates CML reports. We don't use CML to instantiate k8s.

rmarquis added a commit to swiss-ai-center/a-guide-to-mlops that referenced this issue Aug 14, 2024
We will not use CML for retraining on Kubernetes as this is
seemingly not working anymore on newer Kubernetes releases.

See iterative/cml#1415
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants