Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(OSD-25580) Ship Network Live Migration Metrics to Telemetry #2258

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: sre-network-live-migration
role: alert-rules
name: sre-network-live-migration
namespace: openshift-monitoring
spec:
groups:
- name: sre-network-live-migration
rules:
- expr:
(
max(
max_over_time(
timestamp(
openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"} == 1
)[5d:]
)
)
-
min(
min_over_time(
timestamp(
openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"} == 1
)[5d:]
)
)
)
and on()
openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"} == 1
record: cluster:usage:network_live_migration_duration
- expr: openshift_network_operator_live_migration_blocked
record: cluster:usage:network_live_migration_blocked
- name: sre-network-live-migration-alerts
rules:
- alert: NetworkMigrationDelayedSRE
expr: (cluster:usage:network_live_migration_duration > (sum(cluster:node_roles) * 1800))
for: 1m
dakotalongRH marked this conversation as resolved.
Show resolved Hide resolved
labels:
severity: critical
namespace: openshift-network-operator
annotations:
message: Live migration from SDN to OVN is taking much longer than expected to complete.
- alert: NetworkMigrationBlocked
expr: cluster:usage:network_live_migration_blocked == 1
for: 5m
labels:
severity: warning
namespace: openshift-network-operator
annotations:
message: Network operator cannot start the requested migration from SDN to OVN due to {{ $labels.reason }}.
40 changes: 40 additions & 0 deletions hack/00-osd-managed-cluster-config-integration.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -37293,6 +37293,46 @@ objects:
namespace: '{{ $labels.namespace }}'
annotations:
message: The weekly Velero backup has not successfully completed
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: sre-network-live-migration
role: alert-rules
name: sre-network-live-migration
namespace: openshift-monitoring
spec:
groups:
- name: sre-network-live-migration
rules:
- expr: ( max( max_over_time( timestamp( openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1 )[5d:] ) ) - min( min_over_time( timestamp( openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1 )[5d:] ) ) ) ) and on() openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1
record: cluster:usage:network-live-migration:duration
- expr: openshift_network_operator_live_migration_blocked
record: cluster:usage:network-live-migration:blocked
- name: sre-network-live-migration-alerts
rules:
- alert: NetworkMigrationDelayedSRE
expr: (cluster:usage:network-live-migration:duration > (sum(cluster:node_roles)
* 1800))
for: 1m
labels:
severity: critical
namespace: openshift-network-operator
annotations:
message: Live migration from SDN to OVN is taking much longer than expected
to complete.
- alert: NetworkMigrationBlocked
expr: cluster:usage:network-live-migration:blocked == 1
for: 5m
labels:
severity: warning
namespace: openshift-network-operator
annotations:
message: Network operator cannot start the requested migration from
SDN to OVN due to {{ $labels.reason }}.
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
40 changes: 40 additions & 0 deletions hack/00-osd-managed-cluster-config-production.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -37293,6 +37293,46 @@ objects:
namespace: '{{ $labels.namespace }}'
annotations:
message: The weekly Velero backup has not successfully completed
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: sre-network-live-migration
role: alert-rules
name: sre-network-live-migration
namespace: openshift-monitoring
spec:
groups:
- name: sre-network-live-migration
rules:
- expr: ( max( max_over_time( timestamp( openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1 )[5d:] ) ) - min( min_over_time( timestamp( openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1 )[5d:] ) ) ) ) and on() openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1
record: cluster:usage:network-live-migration:duration
- expr: openshift_network_operator_live_migration_blocked
record: cluster:usage:network-live-migration:blocked
- name: sre-network-live-migration-alerts
rules:
- alert: NetworkMigrationDelayedSRE
expr: (cluster:usage:network-live-migration:duration > (sum(cluster:node_roles)
* 1800))
for: 1m
labels:
severity: critical
namespace: openshift-network-operator
annotations:
message: Live migration from SDN to OVN is taking much longer than expected
to complete.
- alert: NetworkMigrationBlocked
expr: cluster:usage:network-live-migration:blocked == 1
for: 5m
labels:
severity: warning
namespace: openshift-network-operator
annotations:
message: Network operator cannot start the requested migration from
SDN to OVN due to {{ $labels.reason }}.
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
40 changes: 40 additions & 0 deletions hack/00-osd-managed-cluster-config-stage.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -37293,6 +37293,46 @@ objects:
namespace: '{{ $labels.namespace }}'
annotations:
message: The weekly Velero backup has not successfully completed
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: sre-network-live-migration
role: alert-rules
name: sre-network-live-migration
namespace: openshift-monitoring
spec:
groups:
- name: sre-network-live-migration
rules:
- expr: ( max( max_over_time( timestamp( openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1 )[5d:] ) ) - min( min_over_time( timestamp( openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1 )[5d:] ) ) ) ) and on() openshift_network_operator_live_migration_condition{type="NetworkTypeMigrationInProgress"}
== 1
record: cluster:usage:network-live-migration:duration
- expr: openshift_network_operator_live_migration_blocked
record: cluster:usage:network-live-migration:blocked
- name: sre-network-live-migration-alerts
rules:
- alert: NetworkMigrationDelayedSRE
expr: (cluster:usage:network-live-migration:duration > (sum(cluster:node_roles)
* 1800))
for: 1m
labels:
severity: critical
namespace: openshift-network-operator
annotations:
message: Live migration from SDN to OVN is taking much longer than expected
to complete.
- alert: NetworkMigrationBlocked
expr: cluster:usage:network-live-migration:blocked == 1
for: 5m
labels:
severity: warning
namespace: openshift-network-operator
annotations:
message: Network operator cannot start the requested migration from
SDN to OVN due to {{ $labels.reason }}.
- apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
Expand Down
2 changes: 1 addition & 1 deletion resources/cluster-monitoring-config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ prometheusK8s:
writeRelabelConfigs:
- sourceLabels: [__name__]
action: keep
regex: '(addon_operator_addons_count|addon_operator_reconcile_error|addon_operator_addon_health_info|addon_operator_ocm_api_requests_durations|addon_operator_ocm_api_requests_durations_sum|addon_operator_ocm_api_requests_durations_count|addon_operator_paused|cluster_admin_enabled|limited_support_enabled|identity_provider|cpms_enabled|ingress_canary_route_reachable|ocm_agent_service_log_sent_total|sre:slo:probe_success_api|sre:slo:probe_success_console|sre:slo:upgradeoperator_upgrade_result|sre:slo:imageregistry_http_requests_total|sre:slo:oauth_server_requests_total|sre:sla:outage_5_minutes|sre:slo:apiserver_28d_slo|sre:slo:console_28d_slo|sre:error_budget_burn:apiserver_28d_slo|sre:error_budget_burn:console_28d_slo|sre:operators:succeeded|sre:record:upgradeoperator_upgrade_healthcheck_result)'
regex: '(addon_operator_addons_count|addon_operator_reconcile_error|addon_operator_addon_health_info|addon_operator_ocm_api_requests_durations|addon_operator_ocm_api_requests_durations_sum|addon_operator_ocm_api_requests_durations_count|addon_operator_paused|cluster_admin_enabled|limited_support_enabled|identity_provider|cpms_enabled|ingress_canary_route_reachable|ocm_agent_service_log_sent_total|sre:slo:probe_success_api|sre:slo:probe_success_console|sre:slo:upgradeoperator_upgrade_result|sre:slo:imageregistry_http_requests_total|sre:slo:oauth_server_requests_total|sre:sla:outage_5_minutes|sre:slo:apiserver_28d_slo|sre:slo:console_28d_slo|sre:error_budget_burn:apiserver_28d_slo|sre:error_budget_burn:console_28d_slo|sre:operators:succeeded|sre:record:upgradeoperator_upgrade_healthcheck_result|cluster:usage:network_live_migration_duration|cluster:usage:network_live_migration_blocked)'
queueConfig:
capacity: 2500
maxShards: 1000
Expand Down