-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Bug Description
Hi, we are using TF+flux to deploy our cluster to EKS. We have separate flux git repos for infrastructure components, apps and our cluster fleet.
When deploying cluster, we deploy infra components in logical groups (storage, networking, monitoring, etc.), using flux's dependency handling to make sure the crds, controllers and configs are deployed in the correct order.
As part of our networking component group, we install aws-load-balancer-controller, as well as a few other components (Envoy gateway, cert-manager, external-dns), currently all via helm charts.
The problem is that on the initial cluster deploy, flux HelmReleases for these other components will fail with the error:
* Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.networking.svc:443/mutate-v1-service?timeout=10s": no endpoints available for service "aws-load-balancer-webhook-service"
Flux HelmRelease does not retry installs by default, so this leaves these releases in a permanent failure, requiring us to manually delete them and have flux re-reconcile them, which then succeeds and everyone on the cluster are happy.
If we increase the number of retries on these HelmReleases, the initial deploy will succeed (after a retry).
I don't know much about mutating webhooks, but this seems like aws-load-balancer-controller registers its mutating webhook before aws-load-balancer-webhook-service service is ready to handle requests, and these initial requests time out.
Steps to Reproduce
- provision a new EKS cluster
- deploy a few components using flux
HelmRelease, includingaws-load-balancert-controller(if required, I can try to provide a minimal project to reproduce this).
Expected Behavior
Other HelmReleases not erroring out with the (presumably) timeout error during the initial cluster deployment.
Current Workarounds
I could add retries to all other HelmReleases, or make aws-load-balancer-controller a separate component, making all other infra components dependent on it, but that seems like a bad approach, when other components don't depend in it being fully deployed.
Environment
- AWS Load Balancer controller version: v2.12.0
- Kubernetes version: 1.31
- Using EKS (yes/no), if so version?: 1.31
- Using Service or Ingress:
- AWS region: eu-west-1
- How was the aws-load-balancer-controller installed:
helm ls:
aws-load-balancer-controller networking 1 2025-04-10 11:28:42.887441528 +0000 UTC deployed aws-load-balancer-controller-1.12.0 v2.12.0
- helm values:
USER-SUPPLIED VALUES:
clusterName: my-cluster
controllerConfig:
featureGates:
ServiceTypeLoadBalancerOnly: true
serviceAccount:
create: false
name: aws-load-balancer-controller
- Current state of the Controller configuration:
kubectl -n <controllernamespace> describe deployment aws-load-balancer-controller
Name: aws-load-balancer-controller
Namespace: networking
CreationTimestamp: Thu, 10 Apr 2025 13:28:46 +0200
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=aws-load-balancer-controller
app.kubernetes.io/version=v2.12.0
helm.sh/chart=aws-load-balancer-controller-1.12.0
helm.toolkit.fluxcd.io/name=aws-load-balancer-controller
helm.toolkit.fluxcd.io/namespace=networking
Annotations: deployment.kubernetes.io/revision: 1
meta.helm.sh/release-name: aws-load-balancer-controller
meta.helm.sh/release-namespace: networking
Selector: app.kubernetes.io/instance=aws-load-balancer-controller,app.kubernetes.io/name=aws-load-balancer-controller
Replicas: 2 desired | 2 updated | 2 total | 2 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app.kubernetes.io/instance=aws-load-balancer-controller
app.kubernetes.io/name=aws-load-balancer-controller
Annotations: prometheus.io/port: 8080
prometheus.io/scrape: true
Service Account: aws-load-balancer-controller
Containers:
aws-load-balancer-controller:
Image: public.ecr.aws/eks/aws-load-balancer-controller:v2.12.0
Ports: 9443/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--cluster-name=my-cluster
--ingress-class=alb
--feature-gates=ServiceTypeLoadBalancerOnly=true
Liveness: http-get http://:61779/healthz delay=30s timeout=10s period=10s #success=1 #failure=2
Readiness: http-get http://:61779/readyz delay=10s timeout=10s period=10s #success=1 #failure=2
Environment: <none>
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: aws-load-balancer-tls
Optional: false
Priority Class Name: system-cluster-critical
Node-Selectors: <none>
Tolerations: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: aws-load-balancer-controller-c95cbff64 (2/2 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 29m deployment-controller Scaled up replica set aws-load-balancer-controller-c95cbff64 to 2