Releases: giantswarm/prometheus-rules
Releases · giantswarm/prometheus-rules
v4.61.0
Changed
- Add
grafana-postgresql
in theObservabilityStorageSpaceTooLow
alert's monitored PVCs.
Added
- Add
GrafanaPostgresqlReplicationFailure
andGrafanaPostgresqlArchivingFailure
alerting rules ingrafana.rules.yml
. - Vintage cleanup:
- Removed code behind obvious vintage/capi conditions in Cabbage rules.
- Removed code behind obvious vintage/capi conditions in Honeybadger rules.
- Removed code behind obvious vintage/capi conditions in Tenet rules.
- Removed code behind obvious vintage/capi conditions in Shield rules.
- Remove the "aws" provider.
- Clean up mimir-heartbeat type that was needed when we have both old and new heartbeats.
- Add
CAPATooManyReconciliations
to page when CAPA controllers are stuck reconciling over and over.
v4.60.0
Changed
Added
-
Add
OnPremCloudProviderAPIIsDown
alert to all clusters -
Vintage cleanup:
- Stopped running tests for vintage. Meaning some vintage-specific labels had to be removed.
- Removed code behind obvious vintage/capi conditions in Atlas rules.
- Removed code behind obvious vintage/capi conditions in Phoenix rules.
v4.59.2
Changed
- Improved
AlloyUnhealthyComponents
alert by adding pod name
v4.59.1
Changed
LogForwardingErrors
: don't page out of business hours
v4.59.0
Added
- Add new alert
KonfigureOperatorDeploymentNotSatisfied
: whenkonfigure-operator
deployment ingiantswarm
namespace is not ready for 30 mins. - Add new alert
KonfigurationReconciliationFailed
: when aManagementClusterConfiguration
CR in notReady
condition for 10 mins.
v4.58.0
Changed
- DeploymentNotSatisfiedAtlas: lower sensitivity and page only during business hours
Added
- Add new alert
ClusterUpgradeStuck
to detect if the cluster app cannot be upgraded.
v4.57.0
Changed
- PromtailRequestsErrors does not fire anymore when alloy-logs is running
Added
- Added PromtailConflictsWithAlloy alert
v4.56.1
Changed
- Reenabled storage alerts LogVolumeSpaceTooLow and RootVolumeSpaceTooLow as paging during working hours until we have node problem detector deployed.
Fixed
- Fix SLOs recording rules sent to Grafana Cloud that sometimes trigger PrometheusRulesFailure due to the origin metric pod changing.
v4.56.0
Changed
- Improved
ClusterAutoscalerFailedScaling
alert to detect stuck states where cluster-autoscaler fails to scale.
v4.55.0
Changed
- Improve ClusterCrossplaneResourcesNotReady with new metrics where available
- Improve alert for Karpenter machines not being Ready
Fixed
- Use
exported_namespace
for certificate expiration alerts.
Removed
- Remove alerts related to
alloy-rules
.