Skip to content

buildkitd rootless on GKE: runc run failed (unable to join session keyring: disk quota exceeded) #6247

@alex-devops-sh

Description

@alex-devops-sh

Description of bug

I’m running buildkitd in rootless mode on a GKE cluster. Most of the time everything works fine, but occasionally (after ~1–2 weeks of uptime), builds start failing with the following error:

runc run failed: unable to start container process: error during container init: 
unable to join session keyring: unable to create session key: disk quota exceeded

Restarting the buildkitd container resolves the problem temporarily, but it eventually reappears.

Deployment setup:
I deploy buildkitd as a service in Kubernetes with the following configuration:

containers:
  - name: buildkitd
    image: moby/buildkit:master-rootless
    args:
      - --addr
      - unix:///run/user/1000/buildkit/buildkitd.sock
      - --addr
      - tcp://0.0.0.0:1234
      - --oci-worker-no-process-sandbox
      - --oci-worker-gc
      - --oci-worker-rootless
      - --oci-worker-gc-keepstorage=2000,15000,70000
      - --oci-max-parallelism=10
    env:
      - name: "BUILDKIT_SESSION_KEYRING"
        value: "0"
     securityContext:
            seccompProfile:
              type: Unconfined
            appArmorProfile:
              type: Unconfined
            runAsUser: 1000
            runAsGroup: 1000

The GitLab runner (also in Kubernetes) runs build jobs using this container image:
moby-buildkit:v0.23.2-v0.20.3-28.3.2
Job args:

containerize:
  image: moby-buildkit:v0.23.2-v0.20.3-28.3.2
  variables:
    BUILDKITD_FLAGS: --oci-worker-no-process-sandbox
    BUILDKIT_HOST: tcp://buildkitd.gitlab-runner.svc.cluster.local:1234
    BUILD_ARGS: ""
    CACHE_ARGS: "--export-cache type=registry,ref=${CI_REGISTRY_IMAGE}/${IMAGE_NAME}:cache-${CI_COMMIT_REF_SLUG},mode=max,image-manifest=true --import-cache type=registry,ref=${CI_REGISTRY_IMAGE}/${IMAGE_NAME}:cache-${CI_COMMIT_REF_SLUG}"
  script: ...

Setting BUILDKIT_SESSION_KEYRING=0 didn’t prevent the issue.
Restarting the container temporarily fixes it, which suggests a resource leak.
Node metrics show that disk usage is not full and CPU/memory quotas are not exceeded when the error occurs.

Could this be related to session keyring exhaustion in rootless mode?

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions