Add PodsUnschedulable alert #1657

hervenicol · 2025-06-23T09:57:14Z

Towards: https://github.com/giantswarm/giantswarm/issues/33710

This PR alerts when multiple pods are unschedulable in the kube-system namespace.
This shows a general problem with the cluster, and in the future this should trigger an inhibition for alerts about individual broken components (like LoggingAgentDown).

Currently paging atlas even though this is kaas-related.
We'll probably discuss paging kaas when the alert is proven to work as intended.

Checklist

Update CHANGELOG.md
Add Unit tests
Follow Alert structure
Consider creating a dashboard (guidelines) (if it does not exist already) to help oncallers monitor the status of the issue.
Request review from oncall area, as well as team (e.g: oncall-kaas-cloud GitHub group).

CHANGELOG.md

TheoBrigitte · 2025-06-23T09:59:40Z

helm/prometheus-rules/templates/kaas/tenet/alerting-rules/pods.rules.yml

+          # only keep those that have been unschedulable for more than 5min over the past 30min
+          [30m:]) > 5 


Is it "firing only when there are more than 5" or "more than 5mn" ?

5 minutes.
I replaced min with minutes in my comment to make it clearer.

TheoBrigitte · 2025-06-23T10:00:08Z

helm/prometheus-rules/templates/kaas/tenet/alerting-rules/pods.rules.yml

+          [30m:]) > 5 
+        # count per cluster
+        ) by (cluster_id, cluster_type, customer, installation, pipeline, provider, region)
+        > 2


I wanted to start with a high enough value to not get paged for nothing, but probably "at least 2 pods consistently failing" is a high enough value.
I'll reduce it to "at least 2".

Ok, maybe add a comment about this

TheoBrigitte · 2025-06-23T10:00:53Z

helm/prometheus-rules/templates/kaas/tenet/alerting-rules/pods.rules.yml

+        # Let's start sending these alerts to atlas, then we can swich to provider team when properly tuned.
+        team: atlas


I am not so sure about this I would rather hand this over to kaas, this is not an observability topic.

That's the plan, but only once we're happy with how the alert behaves.

Still I think it's up to kaas to adjust the alert and get familiar with it.

All right, let them deal with this.
I've updated the code, but I'll need to warn them in Slack as well before merging then.

Gacko · 2025-06-23T20:22:50Z

Shouldn't this already be covered by an alert about unsatisfied Deployments/DaemonSets?

QuentinBisson · 2025-06-23T21:08:02Z

The alert you think about targets usually 1 deployment. This alert as for goal to detect cluster instabilities

Gacko · 2025-06-25T05:25:12Z

Are you sure? I'm pretty sure we have a generic one. But yeah, rather be safe than sorry. 🙂

hervenicol · 2025-06-30T08:37:09Z

Ping @giantswarm/team-phoenix and @giantswarm/team-rocket

This new alert would page you.
I linked to the "most fitting" opsrecipe I could find, but it probably deserves some improvements.
Are you all good with it?

hervenicol self-assigned this Jun 23, 2025

hervenicol requested review from a team as code owners June 23, 2025 09:57

TheoBrigitte reviewed Jun 23, 2025

View reviewed changes

hervenicol force-pushed the podsunschedulable branch 2 times, most recently from e9a32ca to 314cb95 Compare June 23, 2025 12:34

Gacko approved these changes Jun 25, 2025

View reviewed changes

hervenicol force-pushed the podsunschedulable branch 2 times, most recently from 0e61e0d to f333ef0 Compare June 26, 2025 14:43

hervenicol added 3 commits June 27, 2025 15:25

add PodsUnschedulable alert

91fb768

update threshold

f2352bf

send alert to kaas

92867b7

hervenicol force-pushed the podsunschedulable branch from f333ef0 to 92867b7 Compare June 27, 2025 13:27

improved comments

a0ef117

hervenicol mentioned this pull request Jun 30, 2025

new cancel_if_cluster_broken inhibition giantswarm/observability-operator#464

Merged

3 tasks

hervenicol added 2 commits July 3, 2025 16:45

Merge branch 'main' into podsunschedulable

bd6b08d

Merge branch 'main' into podsunschedulable

9996454

hervenicol merged commit 285805e into main Jul 3, 2025
6 checks passed

hervenicol deleted the podsunschedulable branch July 3, 2025 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PodsUnschedulable alert #1657

Add PodsUnschedulable alert #1657

Uh oh!

hervenicol commented Jun 23, 2025

Uh oh!

Uh oh!

TheoBrigitte Jun 23, 2025

Uh oh!

hervenicol Jun 23, 2025

Uh oh!

TheoBrigitte Jun 23, 2025

Uh oh!

hervenicol Jun 23, 2025

Uh oh!

TheoBrigitte Jun 23, 2025

Uh oh!

TheoBrigitte Jun 23, 2025

Uh oh!

hervenicol Jun 23, 2025

Uh oh!

TheoBrigitte Jun 23, 2025

Uh oh!

hervenicol Jun 27, 2025

Uh oh!

Gacko commented Jun 23, 2025

Uh oh!

QuentinBisson commented Jun 23, 2025

Uh oh!

Gacko commented Jun 25, 2025

Uh oh!

hervenicol commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

		# only keep those that have been unschedulable for more than 5min over the past 30min
		[30m:]) > 5

		# Let's start sending these alerts to atlas, then we can swich to provider team when properly tuned.
		team: atlas

Add PodsUnschedulable alert #1657

Add PodsUnschedulable alert #1657

Uh oh!

Conversation

hervenicol commented Jun 23, 2025

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gacko commented Jun 23, 2025

Uh oh!

QuentinBisson commented Jun 23, 2025

Uh oh!

Gacko commented Jun 25, 2025

Uh oh!

hervenicol commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!