Skip to content

"No endpoints available for service 'aws-load-balancer-webhook-service'" helm install error on initial deployment #4140

@bozho

Description

@bozho

Bug Description
Hi, we are using TF+flux to deploy our cluster to EKS. We have separate flux git repos for infrastructure components, apps and our cluster fleet.

When deploying cluster, we deploy infra components in logical groups (storage, networking, monitoring, etc.), using flux's dependency handling to make sure the crds, controllers and configs are deployed in the correct order.

As part of our networking component group, we install aws-load-balancer-controller, as well as a few other components (Envoy gateway, cert-manager, external-dns), currently all via helm charts.

The problem is that on the initial cluster deploy, flux HelmReleases for these other components will fail with the error:

* Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.networking.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"

Flux HelmRelease does not retry installs by default, so this leaves these releases in a permanent failure, requiring us to manually delete them and have flux re-reconcile them, which then succeeds and everyone on the cluster are happy.

If we increase the number of retries on these HelmReleases, the initial deploy will succeed (after a retry).

I don't know much about mutating webhooks, but this seems like aws-load-balancer-controller registers its mutating webhook before aws-load-balancer-webhook-service service is ready to handle requests, and these initial requests time out.

Steps to Reproduce

  • provision a new EKS cluster
  • deploy a few components using flux HelmRelease, including aws-load-balancert-controller (if required, I can try to provide a minimal project to reproduce this).

Expected Behavior
Other HelmReleases not erroring out with the (presumably) timeout error during the initial cluster deployment.

Current Workarounds
I could add retries to all other HelmReleases, or make aws-load-balancer-controller a separate component, making all other infra components dependent on it, but that seems like a bad approach, when other components don't depend in it being fully deployed.

Environment

  • AWS Load Balancer controller version: v2.12.0
  • Kubernetes version: 1.31
  • Using EKS (yes/no), if so version?: 1.31
  • Using Service or Ingress:
  • AWS region: eu-west-1
  • How was the aws-load-balancer-controller installed:
    • helm ls:
aws-load-balancer-controller    networking      1               2025-04-10 11:28:42.887441528 +0000 UTC deployed        aws-load-balancer-controller-1.12.0     v2.12.0
  • helm values:
USER-SUPPLIED VALUES:
clusterName: my-cluster
controllerConfig:
  featureGates:
    ServiceTypeLoadBalancerOnly: true
serviceAccount:
  create: false
  name: aws-load-balancer-controller
  • Current state of the Controller configuration:
    • kubectl -n <controllernamespace> describe deployment aws-load-balancer-controller
Name:                   aws-load-balancer-controller
Namespace:              networking
CreationTimestamp:      Thu, 10 Apr 2025 13:28:46 +0200
Labels:                 app.kubernetes.io/instance=aws-load-balancer-controller
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/name=aws-load-balancer-controller
                        app.kubernetes.io/version=v2.12.0
                        helm.sh/chart=aws-load-balancer-controller-1.12.0
                        helm.toolkit.fluxcd.io/name=aws-load-balancer-controller
                        helm.toolkit.fluxcd.io/namespace=networking
Annotations:            deployment.kubernetes.io/revision: 1
                        meta.helm.sh/release-name: aws-load-balancer-controller
                        meta.helm.sh/release-namespace: networking
Selector:               app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas:               2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app.kubernetes.io/instance=aws-load-balancer-controller
                    app.kubernetes.io/name=aws-load-balancer-controller
  Annotations:      prometheus.io/port: 8080
                    prometheus.io/scrape: true
  Service Account:  aws-load-balancer-controller
  Containers:
   aws-load-balancer-controller:
    Image:       public.ecr.aws/eks/aws-load-balancer-controller:v2.12.0
    Ports:       9443/TCP, 8080/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      --cluster-name=my-cluster
      --ingress-class=alb
      --feature-gates=ServiceTypeLoadBalancerOnly=true
    Liveness:     http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
    Readiness:    http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
    Environment:  <none>
    Mounts:
      /tmp/k8s-webhook-server/serving-certs from cert (ro)
  Volumes:
   cert:
    Type:               Secret (a volume populated by a Secret)
    SecretName:         aws-load-balancer-tls
    Optional:           false
  Priority Class Name:  system-cluster-critical
  Node-Selectors:       <none>
  Tolerations:          <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   aws-load-balancer-controller-c95cbff64 (2/2 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  29m   deployment-controller  Scaled up replica set aws-load-balancer-controller-c95cbff64 to 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.priority/awaiting-more-evidenceLowest priority. Possibly useful, but not yet enough support to actually get it done.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions