Skip to content

Releases: giantswarm/prometheus-rules

v4.70.0

04 Jul 06:56
d9fa3ba
Compare
Choose a tag to compare

Added

  • Added PodsUnschedulable alert

Changed

  • LokiObjectStorageLowRate: don't page when WCs are being created

v4.69.0

03 Jul 14:53
f32d2db
Compare
Choose a tag to compare

Added

  • add GrafanaPostgresqlRecoveryTestFailed alerting rule.

Changed

  • PrometheusOperatorRejectedResources: only page for MC resources

Removed

  • DuplicatePrometheusOperatorKubeletService was for clusters before v20, which we don't have anymore.

v4.68.0

02 Jul 10:01
039c70d
Compare
Choose a tag to compare

Changed

  • Update CoreDNS alerts to page only for resources in "kube-system" namespace.
  • Route FluxKustomizationFailed for silences kustomization to Atlas.

v4.67.0

27 Jun 08:10
8292f79
Compare
Choose a tag to compare

Changed

  • FluentbitDropRatio only pages for management cluster instances (giantswarm-managed).

Removed

  • Removed FluentbitTooManyErrors alerts, at this is already covered by FluentbitDropRatio alerts and they mostly page together.

v4.66.0

24 Jun 07:12
6103d6f
Compare
Choose a tag to compare

Added

  • Added cancel_if_metrics_broken inhibition to following alerts:
    • ManagementClusterDeploymentMissingCAPA
    • ManagementClusterDeploymentMissingCAPI
    • ETCDBackupMetricsMissing
    • PrometheusMissingGrafanaCloud
    • MimirToGrafanaCloudExporterDown
    • ManagementClusterDexAppMissing
  • Add CiliumAgentPodPending alert for Cabbage.

Changed

  • LogForwardingErrors description improvement

v4.65.1

16 Jun 08:00
3978b31
Compare
Choose a tag to compare

Changed

  • Increase MimirIngesterNeedsToBeScaledUp alert's time to trigger from 6h to 12h to avoid noise coming from temporary spikes.
  • WorkloadClusterWebhookDurationExceedsTimeoutSolutionEngineers alert: make it page only during business hours, and increase delay to 1h before it pages
  • MetricForwardingErrors alert: make it less sensitive

v4.65.0

10 Jun 09:13
1d9f5b8
Compare
Choose a tag to compare

Changed

  • Improved ClusterAutoscalerFailedScaling alert expression to reduce false positives by detecting ongoing scaling failures rather than cumulative historical failures.

v4.64.0

05 Jun 10:34
96da361
Compare
Choose a tag to compare

Changed

  • Removed grafana from DeploymentNotSatisfiedAtlas because it's already monitored via GrafanaDown alert.
  • Rework Rocket's ManagementClusterContainerIsRestartingTooFrequently to use pod names as the selector.
  • Update alert for Cilium HelmRelease to match timeout.

v4.63.0

02 Jun 08:41
a9be467
Compare
Choose a tag to compare

Added

  • Add IncorrectResourceUsageData alert.

Changed

  • Made MimirIngesterNeedsToBeScaledUp alert less sensitive to CPU usage.
  • Increase MimirIngesterNeedsToBeScaledUp alert's time to trigger from 1h to 6h to avoid noise coming from temporary spikes like from stable-testing installations (giantswarm/giantswarm#33513)
  • Rewrite Flux alerting rules towards the gotk_resource_info emitted by the Kube State Metrics.
  • Drop customer-related alerting rules of Flux.
  • Rules unit tests: support for $provider template so we can move provider-specific tests to global tests.
  • Rules unit tests: simplify files organization by removing the capi folder. Also fixes a bug in cloud-director tests.
  • Rules linting: run against all configured providers.
  • Exclude more containers from Rocket's ManagementClusterContainerIsRestartingTooFrequently alert.

v4.62.0

15 May 08:30
4f16aea
Compare
Choose a tag to compare

Added

  • Add AppAdmissionControllerWebhookDurationExceedsTimeout alert, business hours only.

Removed

  • Remove app-admission-controller from generic ManagementClusterWebhookDurationExceedsTimeout alert.

Changed

  • Remove duplicate test files for Atlas since all tests are the same accross all CAPI providers.
  • Remove duplicate test files for Honeybadger since all tests are the same accross all CAPI providers.
  • Remove duplicate test files for Shield since all tests are the same accross all CAPI providers.
  • Remove duplicate test files for Tenet since all tests are the same accross all CAPI providers.