Skip to content

Commit 8d961e1

Browse files
committed
Improve ClusterAutoscalerFailedScaling alert to reduce false positives
1 parent 96da361 commit 8d961e1

File tree

2 files changed

+8
-3
lines changed

2 files changed

+8
-3
lines changed

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Changed
11+
12+
- Improved `ClusterAutoscalerFailedScaling` alert expression to reduce false positives by detecting ongoing scaling failures rather than cumulative historical failures.
13+
1014
## [4.64.0] - 2025-06-05
1115

1216
### Changed

helm/prometheus-rules/templates/kaas/tenet/alerting-rules/cluster-autoscaler.rules.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,14 @@ spec:
2626
topic: cluster-autoscaler
2727
- alert: ClusterAutoscalerFailedScaling
2828
annotations:
29-
description: '{{`Cluster-Autoscaler on {{ $labels.cluster_id }} has failed scaling up.`}}'
29+
description: '{{`Cluster-Autoscaler on {{ $labels.cluster_id }} has failed scaling up {{ $value | printf "%.0f" }} times recently.`}}'
3030
runbook_url: https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/cluster-autoscaler-scaling/
31-
expr: cluster_autoscaler_failed_scale_ups_total{provider=~"capa|capz|eks"} > 0
32-
for: 15m
31+
expr: increase(cluster_autoscaler_failed_scale_ups_total{provider=~"capa|capz|eks"}[15m]) > 3 and rate(cluster_autoscaler_failed_scale_ups_total{provider=~"capa|capz|eks"}[5m]) > 0
32+
for: 5m
3333
labels:
3434
area: kaas
3535
cancel_if_outside_working_hours: "true"
36+
cancel_if_cluster_has_no_workers: "true"
3637
severity: page
3738
team: tenet
3839
topic: cluster-autoscaler

0 commit comments

Comments
 (0)