Skip to content

Commit 0436498

Browse files
committed
add recommendations for Datadog metrics
1 parent 8276bfa commit 0436498

File tree

4 files changed

+64
-9
lines changed

4 files changed

+64
-9
lines changed
Loading
Loading
Loading

docs/gitbook/usage/metrics.md

+64-9
Original file line numberDiff line numberDiff line change
@@ -303,7 +303,7 @@ spec:
303303
destination_workload:{{ target }},
304304
!response_code:404
305305
}.as_count()
306-
/
306+
/
307307
sum:istio.mesh.request.count{
308308
reporter:destination,
309309
destination_workload_namespace:{{ namespace }},
@@ -326,6 +326,61 @@ Reference the template in the canary analysis:
326326
interval: 1m
327327
```
328328

329+
### Common Pitfalls for Datadog
330+
331+
The following examples use a ingress-nginx replicaset of three web servers, and a client that performs approximately
332+
5 requests per second, constantly. Each of the three nginx web servers report their metrics every 15 seconds.
333+
334+
#### Pitfall 1: Converting metrics to rates (using `.as_rate()`) can have high sampling noise
335+
336+
Example query `sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate()`
337+
for the past 5 minutes.
338+
339+
![Sample Datadog time series showing high sampling noise](./images/datadog-sampling-noise.png)
340+
341+
Datadog does an automatic rollup (up/downsampling) of a timeseries, and the time resolution is based on the
342+
requested interval. The longer the interval, the coarser the time resolution and vice versa. This means, for short
343+
intervals, the time resolution of the query response can be higher than the reporting rate of the app, leading to
344+
a spiky rate graph that oscillates erratically, not even close to the real rate.
345+
346+
This amplifies even more when applying e.g. `default_zero()`, where Datadog inserts zeros for every empty time interval
347+
in the response.
348+
349+
Example query `default_zero(sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate())`
350+
for the past 5 minutes.
351+
352+
![Sample Datadog time series showing incorrect zeros inserted](./images/datadog-default-zero.png)
353+
354+
To overcome this, you should manually apply a `rollup()` to your query, aggregating at least one complete reporting
355+
interval of your application (in this case: 15 seconds).
356+
357+
#### Pitfall 2: Datadog metrics tend to return incomplete (thus usually too small) values for the most recent time intervals
358+
359+
Example query: `sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate().rollup(15)`
360+
361+
![Sample Datadog time series showing incomplete recent samples](./images/datadog-recent-samples.png)
362+
363+
The rightmost bar displays a smaller value, because not all targets contributing to the metric have reported
364+
the most recent time interval yet. In extreme cases, the value will be zero. As time goes by, this bar will fill,
365+
but the most recent bar(s) are almost always incomplete. Sometimes, the Datadog UI shades the last bucket in the
366+
example as incomplete, but, this "incomplete data" information is not part of the returned time series, so Flagger
367+
cannot know which samples to trust.
368+
369+
#### Recommendations on Datadog metrics evaluations
370+
371+
Flagger queries Datadog for an interval between and `analysis.metrics.interval` ago and `now` , and
372+
then (since release (TODO: unreleased)) takes the **first** sample of the result set. It cannot take the
373+
last one, because recent samples might be incomplete. So, for an interval of e.g. `2m`, Flagger evaluates
374+
the value from 2 minutes ago.
375+
376+
- In order to have a result that is not oscillating, you should apply a rollup of at least the reporting interval of
377+
the observed target
378+
- In order to have a recent result, you should use a small interval, but...
379+
- In order to have a complete result, you must take a query interval that contains at least one full rollup window.
380+
This should be the case if the interval is at least two times the rollup window
381+
- In order to always have a metric result, you can apply functions like `default_zero()`, but you must
382+
make sure that receiving a zero does not fail your evaluation
383+
329384
## Amazon CloudWatch
330385

331386
You can create custom metric checks using the CloudWatch metrics provider.
@@ -438,11 +493,11 @@ spec:
438493
secretRef:
439494
name: newrelic
440495
query: |
441-
SELECT
442-
filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
496+
SELECT
497+
filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
443498
sum(nginx_ingress_controller_requests) * 100
444-
FROM Metric
445-
WHERE metricName = 'nginx_ingress_controller_requests'
499+
FROM Metric
500+
WHERE metricName = 'nginx_ingress_controller_requests'
446501
AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
447502
```
448503

@@ -538,7 +593,7 @@ spec:
538593
## Google Cloud Monitoring (Stackdriver)
539594

540595
Enable Workload Identity on your cluster, create a service account key that has read access to the
541-
Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger
596+
Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger
542597
service account on Kubernetes. You can take a look at this [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity)
543598

544599
Annotate the flagger service account
@@ -557,7 +612,7 @@ your [service account json](https://cloud.google.com/docs/authentication/product
557612
kubectl create secret generic gcloud-sa --from-literal=project=<project-id>
558613
```
559614
560-
Then reference the secret in the metric template.
615+
Then reference the secret in the metric template.
561616
Note: The particular MQL query used here works if [Istio is installed on GKE](https://cloud.google.com/istio/docs/istio-on-gke/installing).
562617
```yaml
563618
apiVersion: flagger.app/v1beta1
@@ -568,7 +623,7 @@ metadata:
568623
spec:
569624
provider:
570625
type: stackdriver
571-
secretRef:
626+
secretRef:
572627
name: gcloud-sa
573628
query: |
574629
fetch k8s_container
@@ -725,7 +780,7 @@ This will usually be set to the same value as the analysis interval of a `Canary
725780
Only relevant if the `type` is set to `analysis`.
726781
* **arguments (optional)**: Arguments to be passed to an `Analysis`.
727782
Arguments are passed as a list of key value pairs, separated by `;` characters,
728-
e.g. `foo=bar;bar=foo`.
783+
e.g. `foo=bar;bar=foo`.
729784
Only relevant if the `type` is set to `analysis`.
730785

731786
For the type `analysis`, the value returned by the provider is either `0`

0 commit comments

Comments
 (0)