Skip to content

Tenant Kubernetes control-planes are dependent on user webhooks and api services #855

@kvaps

Description

@kvaps

This is more of a question than a bug report, although it looks like an upstream issue. You might have run into it already, because the problem is especially painful when you’re running managed control-planes.

Steps to reproduce

  1. Create a new cluster that uses the Kamaji control-plane.
  2. Install cert-manager and metrics-server in that cluster.
  3. Delete all worker nodes in the cluster.

After step 3 you’ll notice that the kube-apiserver containers begin to restart continuously, showing errors similar to:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Fri, 27 Jun 2025 03:45:19 +0200
      Finished:     Fri, 27 Jun 2025 03:45:56 +0200

logs:

I0627 01:51:13.519903       1 storage_scheduling.go:111] all system priority classes are created successfully or already exist.
W0627 01:51:13.598365       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:51:13.598470       1 controller.go:102] "Unhandled Error" err=<
        loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
        , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
 > logger="UnhandledError"
I0627 01:51:13.639692       1 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
W0627 01:51:13.770139       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:51:13.778185       1 controller.go:113] "Unhandled Error" err="loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: Error, could not get list of group versions for APIService" logger="UnhandledError"
I0627 01:51:13.780788       1 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
I0627 01:51:36.743669       1 controller.go:615] quota admission added evaluator for: roles.rbac.authorization.k8s.io
I0627 01:51:37.045507       1 controller.go:615] quota admission added evaluator for: rolebindings.rbac.authorization.k8s.io
I0627 01:52:02.718675       1 controller.go:615] quota admission added evaluator for: serviceaccounts
I0627 01:52:08.733639       1 controller.go:615] quota admission added evaluator for: daemonsets.apps
I0627 01:52:08.748588       1 controller.go:615] quota admission added evaluator for: deployments.apps
W0627 01:52:13.640042       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:52:13.640116       1 controller.go:102] "Unhandled Error" err=<
        loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to download v1beta1.metrics.k8s.io: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
        , Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
 > logger="UnhandledError"
I0627 01:52:13.641696       1 controller.go:109] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
W0627 01:52:13.787279       1 handler_proxy.go:99] no RequestInfo found in the context
E0627 01:52:13.787336       1 controller.go:113] "Unhandled Error" err="loading OpenAPI spec for \"v1beta1.metrics.k8s.io\" failed with: Error, could not get list of group versions for APIService" logger="UnhandledError"
I0627 01:52:13.788502       1 controller.go:126] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
I0627 01:52:18.045205       1 controller.go:615] quota admission added evaluator for: ciliumendpoints.cilium.io
E0627 01:52:38.795461       1 writers.go:123] "Unhandled Error" err="apiserver was unable to write a JSON response: http: Handler timeout" logger="UnhandledError"
W0627 01:52:38.795464       1 dispatcher.go:217] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cozy-cert-manager.svc:443/validate?timeout=30s": context canceled
E0627 01:52:38.795993       1 finisher.go:175] "Unhandled Error" err="FinishRequest: post-timeout activity - time-elapsed: 579.927µs, panicked: false, err: context canceled, panic-reason: <nil>" logger="UnhandledError"
E0627 01:52:38.796591       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &errors.errorString{s:\"http: Handler timeout\"}: http: Handler timeout" logger="UnhandledError"
E0627 01:52:38.798136       1 writers.go:136] "Unhandled Error" err="apiserver was unable to write a fallback JSON response: http: Handler timeout" logger="UnhandledError"
E0627 01:52:38.799501       1 timeout.go:140] "Post-timeout activity" logger="UnhandledError" timeElapsed="4.545858ms" method="PATCH" path="/apis/cert-manager.io/v1/namespaces/cozy-ingress-nginx/certificates/ingress-nginx-root-cert" result=null
W0627 01:52:38.810270       1 dispatcher.go:217] Failed calling webhook, failing closed webhook.cert-manager.io: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cozy-cert-manager.svc:443/validate?timeout=30s": dial tcp 10.95.75.223:443: connect: operation not permitted

This puts us in a deadlock situation again:

  • main API server depends on custom API servers and admission webhooks.
  • Those webhooks can’t start without worker nodes.
  • The worker nodes can’t be provisioned because the control-plane keeps restarting.

Eventually, everything does recover, but until the workers come up the Kubernetes API keeps flapping because kube-apiserver is repeatedly restarted.

Questions

  • Have you encountered this behaviour before?
  • Is there a generic way to avoid or mitigate it?
  • Is there a hidden flag or configuration option for kube-apiserver that could help?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions