From 2c6514a79bbb5a53b51bcf636707e186e8e5939a Mon Sep 17 00:00:00 2001 From: Aistis Jokubauskas Date: Tue, 6 Dec 2022 14:15:19 +0200 Subject: [PATCH 1/3] Add PrometheusMissingRuleEvaluations runbook A quick runbook to mitigate `PrometheusMissingRuleEvaluations` alert. ref.: - https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule_group - https://www.robustperception.io/rule-groups-for-hierarchical-aggregation/ --- .../PrometheusMissingRuleEvaluations.md | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+) create mode 100644 content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md diff --git a/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md b/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md new file mode 100644 index 0000000..619ce41 --- /dev/null +++ b/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md @@ -0,0 +1,28 @@ +# PrometheusMissingRuleEvaluations + +## Meaning + +Alert fires when prometheus rule_group evaluation takes consistently longer than rule_group interval. + +## Impact + +Rule groups have either alerts or recording rules. If prometheus can not evaluate rules in time - it might fail to trigger alert. + +## Diagnosis + +Quick checks: +- Check if enough resources allocated to promeheus. +- Check if there are no bad neighbors that consume too much CPU. + +Deep dive: +- Use `prometheus_rule_group_iterations_missed_total` metric to identify strugling rule_group. + +## Mitigation + +Quick fixes: +- Increase CPU resources allocation to prometheus. +- Movebad neighbor to different host. + +Deep dive: +- Increase rule evaluate interval. +- Splitup up rule_group into smaller groups if rules do not depend on each other. It should help because rules inside a group are evaluated in sequence. From 3a5605c6620494418d07bce9b559d4cf69787689 Mon Sep 17 00:00:00 2001 From: Aistis Jokubauskas Date: Fri, 10 Feb 2023 17:34:13 +0200 Subject: [PATCH 2/3] Update content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md docs: add links to `rule_group` Co-authored-by: Jonathan Ballet --- content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md b/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md index 619ce41..ed13b0d 100644 --- a/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md +++ b/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md @@ -2,7 +2,7 @@ ## Meaning -Alert fires when prometheus rule_group evaluation takes consistently longer than rule_group interval. +Alert fires when Prometheus [`rule_group`](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule_group) evaluation takes consistently longer than [`rule_group`'s `interval`](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule_group). ## Impact From c00a82e53e3ca84327f7f614c225885c589d4fc4 Mon Sep 17 00:00:00 2001 From: Aistis Jokubauskas Date: Fri, 10 Feb 2023 18:09:45 +0200 Subject: [PATCH 3/3] Apply suggestions from code review docs: grammar fixes and style improvements Co-authored-by: Jonathan Ballet --- .../prometheus/PrometheusMissingRuleEvaluations.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md b/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md index ed13b0d..9476934 100644 --- a/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md +++ b/content/runbooks/prometheus/PrometheusMissingRuleEvaluations.md @@ -6,23 +6,23 @@ Alert fires when Prometheus [`rule_group`](https://prometheus.io/docs/prometheus ## Impact -Rule groups have either alerts or recording rules. If prometheus can not evaluate rules in time - it might fail to trigger alert. +Rule groups have either alerts or recording rules. If Prometheus can not evaluate rules in time, it might fail to trigger alerts. ## Diagnosis Quick checks: -- Check if enough resources allocated to promeheus. +- Check if enough resources allocated to Prometheus. - Check if there are no bad neighbors that consume too much CPU. Deep dive: -- Use `prometheus_rule_group_iterations_missed_total` metric to identify strugling rule_group. +- Use `prometheus_rule_group_iterations_missed_total` metric to identify the struggling rule groups. ## Mitigation Quick fixes: -- Increase CPU resources allocation to prometheus. -- Movebad neighbor to different host. +- Increase CPU resources allocation to Prometheus. +- Move bad neighbors to different hosts. Deep dive: -- Increase rule evaluate interval. -- Splitup up rule_group into smaller groups if rules do not depend on each other. It should help because rules inside a group are evaluated in sequence. +- Increase the [`rule_group`'s evaluation `interval`](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule_group). +- Split up up [rule groups](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule_group) into smaller groups if rules do not depend on each other. It should help because rules inside a group are evaluated in sequence, whereas groups are evaluated in parallel.