-
Notifications
You must be signed in to change notification settings - Fork 686
Description
Describe the bug
We have found that having multiple kustomizations that wait on kustomizations that wait on resources to be reconciled substantially slows down performance of the controllers. There are no obvious resource bottlenecks in CPU or memory when this behavior is observed.
Steps to reproduce
Following the recommended structure for a monorepo, our team started to develop a pattern where'd do something like
apps
foo
init
configs.yaml
kustomization.yaml
meta
meta.yaml
service
deployment.yaml
kustomization.yaml
An example meta.yaml file here would look like
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: foo-init
namespace: flux-system
spec:
targetNamespace: foo
sourceRef:
kind: GitRepository
name: flux-system
path: ./apps/foo/init
interval: 5m
timeout: 2m
prune: true
wait: true
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: foo-service
namespace: flux-system
spec:
targetNamespace: foo
sourceRef:
kind: GitRepository
name: flux-system
dependsOn:
- name: foo-init
path: ./apps/foo/service
interval: 5m
timeout: 2m
prune: true
wait: true
Then at a high level we'd maintain an apps.yaml file closer to our flux system component that would look like
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: foo-app
namespace: flux-system
spec:
sourceRef:
kind: GitRepository
name: flux-system
path: ./apps/foo/meta
interval: 5m
timeout: 2m
prune: true
wait: true
We found that having this wait: true in the top level apps.yaml spec was substantially locking up our reconciliation cycles. Removing these checks made substantially improved performance while changing nothing about the downstream resources that were reconciled. The fluxcd prometheus metrics show this dramatic change in behavior. On 10/13/2025 12:00PM PDT, the manifests were updated to remove these extra wait settings. It takes some time for the system to fully pick up the changes but then reconciliation loops are instantaneous back into milliseconds rather than pegged at the 5 minute timeouts from before.
Expected behavior
In a healthy cluster, resources should be able to wait transitively as much as they want.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
v2.7.0
Flux check
► checking prerequisites
✗ flux 2.7.0 <2.7.2 (new CLI version is available, please upgrade)
✔ Kubernetes 1.33.5-gke.1080000 >=1.32.0-0
► checking version in cluster
✔ distribution: flux-v2.7.0
✔ bootstrapped: true
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v1.4.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.7.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.7.1
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.7.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta3
✔ buckets.source.toolkit.fluxcd.io/v1
✔ externalartifacts.source.toolkit.fluxcd.io/v1
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1
✔ helmreleases.helm.toolkit.fluxcd.io/v2
✔ helmrepositories.source.toolkit.fluxcd.io/v1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1
✔ providers.notification.toolkit.fluxcd.io/v1beta3
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed
Git provider
Github Enterprise
Container Registry provider
Self-hosted Mirror
Additional context
No response
Code of Conduct
- I agree to follow this project's Code of Conduct