Releases: giantswarm/prometheus-rules
Releases · giantswarm/prometheus-rules
v4.70.0
v4.69.0
Added
- add
GrafanaPostgresqlRecoveryTestFailed
alerting rule.
Changed
PrometheusOperatorRejectedResources
: only page for MC resources
Removed
- DuplicatePrometheusOperatorKubeletService was for clusters before v20, which we don't have anymore.
v4.68.0
Changed
- Update CoreDNS alerts to page only for resources in "kube-system" namespace.
- Route
FluxKustomizationFailed
forsilences
kustomization to Atlas.
v4.67.0
Changed
FluentbitDropRatio
only pages for management cluster instances (giantswarm-managed).
Removed
- Removed
FluentbitTooManyErrors
alerts, at this is already covered byFluentbitDropRatio
alerts and they mostly page together.
v4.66.0
Added
- Added
cancel_if_metrics_broken
inhibition to following alerts:ManagementClusterDeploymentMissingCAPA
ManagementClusterDeploymentMissingCAPI
ETCDBackupMetricsMissing
PrometheusMissingGrafanaCloud
MimirToGrafanaCloudExporterDown
ManagementClusterDexAppMissing
- Add CiliumAgentPodPending alert for Cabbage.
Changed
LogForwardingErrors
description improvement
v4.65.1
Changed
- Increase
MimirIngesterNeedsToBeScaledUp
alert's time to trigger from 6h to 12h to avoid noise coming from temporary spikes. - WorkloadClusterWebhookDurationExceedsTimeoutSolutionEngineers alert: make it page only during business hours, and increase delay to 1h before it pages
- MetricForwardingErrors alert: make it less sensitive
v4.65.0
Changed
- Improved
ClusterAutoscalerFailedScaling
alert expression to reduce false positives by detecting ongoing scaling failures rather than cumulative historical failures.
v4.64.0
Changed
- Removed
grafana
fromDeploymentNotSatisfiedAtlas
because it's already monitored viaGrafanaDown
alert. - Rework Rocket's
ManagementClusterContainerIsRestartingTooFrequently
to use pod names as the selector. - Update alert for Cilium HelmRelease to match timeout.
v4.63.0
Added
- Add
IncorrectResourceUsageData
alert.
Changed
- Made
MimirIngesterNeedsToBeScaledUp
alert less sensitive to CPU usage. - Increase
MimirIngesterNeedsToBeScaledUp
alert's time to trigger from 1h to 6h to avoid noise coming from temporary spikes like fromstable-testing
installations (giantswarm/giantswarm#33513) - Rewrite Flux alerting rules towards the
gotk_resource_info
emitted by the Kube State Metrics. - Drop customer-related alerting rules of Flux.
- Rules unit tests: support for
$provider
template so we can move provider-specific tests to global tests. - Rules unit tests: simplify files organization by removing the
capi
folder. Also fixes a bug in cloud-director tests. - Rules linting: run against all configured providers.
- Exclude more containers from Rocket's
ManagementClusterContainerIsRestartingTooFrequently
alert.
v4.62.0
Added
- Add
AppAdmissionControllerWebhookDurationExceedsTimeout
alert, business hours only.
Removed
- Remove
app-admission-controller
from genericManagementClusterWebhookDurationExceedsTimeout
alert.
Changed
- Remove duplicate test files for Atlas since all tests are the same accross all CAPI providers.
- Remove duplicate test files for Honeybadger since all tests are the same accross all CAPI providers.
- Remove duplicate test files for Shield since all tests are the same accross all CAPI providers.
- Remove duplicate test files for Tenet since all tests are the same accross all CAPI providers.