-
Notifications
You must be signed in to change notification settings - Fork 3
Add PodsUnschedulable alert #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# only keep those that have been unschedulable for more than 5min over the past 30min | ||
[30m:]) > 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it "firing only when there are more than 5" or "more than 5mn" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minutes.
I replaced min
with minutes
in my comment to make it clearer.
[30m:]) > 5 | ||
# count per cluster | ||
) by (cluster_id, cluster_type, customer, installation, pipeline, provider, region) | ||
> 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to start with a high enough value to not get paged for nothing, but probably "at least 2 pods consistently failing" is a high enough value.
I'll reduce it to "at least 2".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe add a comment about this
# Let's start sending these alerts to atlas, then we can swich to provider team when properly tuned. | ||
team: atlas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure about this I would rather hand this over to kaas, this is not an observability topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the plan, but only once we're happy with how the alert behaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still I think it's up to kaas to adjust the alert and get familiar with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, let them deal with this.
I've updated the code, but I'll need to warn them in Slack as well before merging then.
e9a32ca
to
314cb95
Compare
Shouldn't this already be covered by an alert about unsatisfied Deployments/DaemonSets? |
The alert you think about targets usually 1 deployment. This alert as for goal to detect cluster instabilities |
Are you sure? I'm pretty sure we have a generic one. But yeah, rather be safe than sorry. 🙂 |
0e61e0d
to
f333ef0
Compare
f333ef0
to
92867b7
Compare
Ping @giantswarm/team-phoenix and @giantswarm/team-rocket This new alert would page you. |
Towards: https://github.com/giantswarm/giantswarm/issues/33710
This PR alerts when multiple pods are unschedulable in the
kube-system
namespace.This shows a general problem with the cluster, and in the future this should trigger an inhibition for alerts about individual broken components (like
LoggingAgentDown
).Currently paging atlas even though this is kaas-related.
We'll probably discuss paging kaas when the alert is proven to work as intended.
Checklist
oncall-kaas-cloud
GitHub group).