-
Notifications
You must be signed in to change notification settings - Fork 3
Improve ClusterAutoscalerFailedScaling alert to reduce false positives #1646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
runbook_url: https://intranet.giantswarm.io/docs/support-and-ops/ops-recipes/cluster-autoscaler-scaling/ | ||
expr: cluster_autoscaler_failed_scale_ups_total{provider=~"capa|capz|eks"} > 0 | ||
for: 15m | ||
expr: increase(cluster_autoscaler_failed_scale_ups_total{provider=~"capa|capz|eks"}[15m]) > 3 and rate(cluster_autoscaler_failed_scale_ups_total{provider=~"capa|capz|eks"}[5m]) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values aren't right I don't think. We had a legit issue that this alert should have caught with cicdprod
on the 27th but this doesn't catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks I'll investigate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we can drop that provider
label. Not sure why we'd need that. We only have autoscaler on CAPA anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but I think you'll then need to drop the for
and just have it trigger as soon as it goes over the value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah removed for
, dropped the provider and adjusted the increase
Towards: https://github.com/giantswarm/giantswarm/issues/33440
This PR improves the
ClusterAutoscalerFailedScaling
alert to reduce false positives by combining two expressions: