Skip to content

Add PodsUnschedulable alert #1657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 3, 2025
Merged

Add PodsUnschedulable alert #1657

merged 6 commits into from
Jul 3, 2025

Conversation

hervenicol
Copy link
Contributor

Towards: https://github.com/giantswarm/giantswarm/issues/33710

This PR alerts when multiple pods are unschedulable in the kube-system namespace.
This shows a general problem with the cluster, and in the future this should trigger an inhibition for alerts about individual broken components (like LoggingAgentDown).

Currently paging atlas even though this is kaas-related.
We'll probably discuss paging kaas when the alert is proven to work as intended.

Checklist

@hervenicol hervenicol self-assigned this Jun 23, 2025
@hervenicol hervenicol requested review from a team as code owners June 23, 2025 09:57
Comment on lines 26 to 27
# only keep those that have been unschedulable for more than 5min over the past 30min
[30m:]) > 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "firing only when there are more than 5" or "more than 5mn" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes.
I replaced min with minutes in my comment to make it clearer.

[30m:]) > 5
# count per cluster
) by (cluster_id, cluster_type, customer, installation, pipeline, provider, region)
> 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to start with a high enough value to not get paged for nothing, but probably "at least 2 pods consistently failing" is a high enough value.
I'll reduce it to "at least 2".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe add a comment about this

Comment on lines 38 to 39
# Let's start sending these alerts to atlas, then we can swich to provider team when properly tuned.
team: atlas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure about this I would rather hand this over to kaas, this is not an observability topic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the plan, but only once we're happy with how the alert behaves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still I think it's up to kaas to adjust the alert and get familiar with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, let them deal with this.
I've updated the code, but I'll need to warn them in Slack as well before merging then.

@hervenicol hervenicol force-pushed the podsunschedulable branch 2 times, most recently from e9a32ca to 314cb95 Compare June 23, 2025 12:34
@Gacko
Copy link
Member

Gacko commented Jun 23, 2025

Shouldn't this already be covered by an alert about unsatisfied Deployments/DaemonSets?

@QuentinBisson
Copy link
Contributor

The alert you think about targets usually 1 deployment. This alert as for goal to detect cluster instabilities

@Gacko
Copy link
Member

Gacko commented Jun 25, 2025

Are you sure? I'm pretty sure we have a generic one. But yeah, rather be safe than sorry. 🙂

@hervenicol hervenicol force-pushed the podsunschedulable branch 2 times, most recently from 0e61e0d to f333ef0 Compare June 26, 2025 14:43
@hervenicol
Copy link
Contributor Author

Ping @giantswarm/team-phoenix and @giantswarm/team-rocket

This new alert would page you.
I linked to the "most fitting" opsrecipe I could find, but it probably deserves some improvements.
Are you all good with it?

@hervenicol hervenicol merged commit 285805e into main Jul 3, 2025
6 checks passed
@hervenicol hervenicol deleted the podsunschedulable branch July 3, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants