-
Notifications
You must be signed in to change notification settings - Fork 30
Add prometheus wpa controller reconcile and wpa valid metrics #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
controllers/metrics.go
Outdated
| metricNamePromLabel = "metric_name" | ||
| reasonPromLabel = "reason" | ||
| transitionPromLabel = "transition" | ||
| reconcileErrPromLabel = "reconcile_err" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it can be only error or type since the metrics name already contains "error"
controllers/metrics.go
Outdated
| var reasonValues = []string{downscaleCappingPromLabelVal, upscaleCappingPromLabelVal, withinBoundsPromLabelVal} | ||
|
|
||
| // reconcileReasonValues contains possible `reconcile_err` label values | ||
| var reconcileReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more self explained, the slice name could include "error"
| var reconcileReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal} | |
| var reconcileErrorReasonValues = []string{nullPromLabelVal, scaleNotFoundPromLabelVal, invalidAPIVersionPromLabelVal, unknownResourcePromLabelVal, failedUpdateReplicasPromLabelVal, failedComputeReplicasPromLabelVal, failedScalePromLabelVal} |
controllers/metrics.go
Outdated
| resourceNamespacePromLabel, | ||
| resourceNamePromLabel, | ||
| resourceKindPromLabel, | ||
| reconcileErrPromLabel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the "reconcile_success" should not include a "error" tag.
maybe we can inverse the metrics logic: reconcile_error and set to 1 if an error happen.
or create 2 metrics:
reconcile_successwithoutreconcileErrPromLabelreconcile_errorsend only in case of error with the error type.
like this a user can monitor WPA withreconcile_successand understand the issue with the metrics "reconcile_error".
@CharlyF and @celenechang WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also WPA Invalid can be a reconcile_error error_type. so maybe we don't need the metric wpa_valid.
3f999fc to
43edfde
Compare
What does this PR do?
Adds 2 metrics:
wpa_controller_reconcile_error:1with tagreason:<short_error_message>if the last reconcile results in an error. If there is no error, no metric is reported.wpa_controller_reconcile_success: Reports1if the last reconcile is successful,0if there was an error.Example:
Motivation
More ways to track WPA and WPA controller errors. There's a
controller_runtime_reconcile_totalmetric, but that only has the labelscontrollerandresult, which don't give much detail about the reconcile error.Describe your test plan
Set up a WPA(s) (example). The WPA should be valid, the target resource should be present, and the Datadog metric should be present and reporting consistently. The
wpa_controller_reconcile_successmetric should be present with value1and the following labels:resource_kindresource_nameresource_namespacewpa_namewpa_namespaceTo visualize metrics, either collect them via the node agent with the
prometheusoropenmetricscheck (example), or check the/metricsendpoint:/metricsendpoint:kubectl exec -it <wpa_controller> -- curl localhost:8383/metricsUpdate the WPA (and/or target resource) to force an error (examples below) and ensure that the
wpa_controller_reconcile_successmetric reports0and that thewpa_controller_reconcile_errormetric reports1with the appropriatereasontag. There shouldn't be any stale metrics; the metrics should update accordingly when going from an error to an ok state, error to another error state, and when the WPA is deleted.This isn't inclusive of all possible errors (and
reasonvalues), but here's a list of a few ways to force some errors:system.load.1.invalid. This should give metrics withreason:failed_compute_replicasandFailed to compute desired number of replicas based on listed metrics.logs in the controller pod.spec.scaleTargetRef.apiVersionto getreason:invalid_api_version:spec.scaleTargetRef.apiVersion. For example,extensions/v1beta1for Deployments looks to have been deprecated in v1.16 so on a newer Kubernetes cluster with the target Deployment using theapps/v1apiVersion, using the old apiVersion results inreason:unknown_resourceand the log lineunable to determine resource for scale target reference:spec.minReplicaslarger than thespec.maxReplicasto show the log messageInvalid WPA specification: watermark pod autoscaler requires the minimum number of replicas to be configured and inferior to the maximumand the tagreason:invalid_wpa_spec: