Conversation
| # only keep those that have been unschedulable for more than 5min over the past 30min | ||
| [30m:]) > 5 |
There was a problem hiding this comment.
Is it "firing only when there are more than 5" or "more than 5mn" ?
There was a problem hiding this comment.
5 minutes.
I replaced min with minutes in my comment to make it clearer.
| [30m:]) > 5 | ||
| # count per cluster | ||
| ) by (cluster_id, cluster_type, customer, installation, pipeline, provider, region) | ||
| > 2 |
There was a problem hiding this comment.
I wanted to start with a high enough value to not get paged for nothing, but probably "at least 2 pods consistently failing" is a high enough value.
I'll reduce it to "at least 2".
There was a problem hiding this comment.
Ok, maybe add a comment about this
| # Let's start sending these alerts to atlas, then we can swich to provider team when properly tuned. | ||
| team: atlas |
There was a problem hiding this comment.
I am not so sure about this I would rather hand this over to kaas, this is not an observability topic.
There was a problem hiding this comment.
That's the plan, but only once we're happy with how the alert behaves.
There was a problem hiding this comment.
Still I think it's up to kaas to adjust the alert and get familiar with it.
There was a problem hiding this comment.
All right, let them deal with this.
I've updated the code, but I'll need to warn them in Slack as well before merging then.
e9a32ca to
314cb95
Compare
|
Shouldn't this already be covered by an alert about unsatisfied Deployments/DaemonSets? |
|
The alert you think about targets usually 1 deployment. This alert as for goal to detect cluster instabilities |
|
Are you sure? I'm pretty sure we have a generic one. But yeah, rather be safe than sorry. 🙂 |
0e61e0d to
f333ef0
Compare
f333ef0 to
92867b7
Compare
|
Ping @giantswarm/team-phoenix and @giantswarm/team-rocket This new alert would page you. |
Towards: https://github.com/giantswarm/giantswarm/issues/33710
This PR alerts when multiple pods are unschedulable in the
kube-systemnamespace.This shows a general problem with the cluster, and in the future this should trigger an inhibition for alerts about individual broken components (like
LoggingAgentDown).Currently paging atlas even though this is kaas-related.
We'll probably discuss paging kaas when the alert is proven to work as intended.
Checklist
oncall-kaas-cloudGitHub group).