@@ -303,7 +303,7 @@ spec:
303
303
destination_workload:{{ target }},
304
304
!response_code:404
305
305
}.as_count()
306
- /
306
+ /
307
307
sum:istio.mesh.request.count{
308
308
reporter:destination,
309
309
destination_workload_namespace:{{ namespace }},
@@ -326,6 +326,61 @@ Reference the template in the canary analysis:
326
326
interval: 1m
327
327
` ` `
328
328
329
+ # ## Common Pitfalls for Datadog
330
+
331
+ The following examples use a ingress-nginx replicaset of three web servers, and a client that performs approximately
332
+ 5 requests per second, constantly. Each of the three nginx web servers report their metrics every 15 seconds.
333
+
334
+ # ### Pitfall 1: Converting metrics to rates (using `.as_rate()`) can have high sampling noise
335
+
336
+ Example query `sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate()`
337
+ for the past 5 minutes.
338
+
339
+ 
340
+
341
+ Datadog does an automatic rollup (up/downsampling) of a timeseries, and the time resolution is based on the
342
+ requested interval. The longer the interval, the coarser the time resolution and vice versa. This means, for short
343
+ intervals, the time resolution of the query response can be higher than the reporting rate of the app, leading to
344
+ a spiky rate graph that oscillates erratically, not even close to the real rate.
345
+
346
+ This amplifies even more when applying e.g. `default_zero()`, where Datadog inserts zeros for every empty time interval
347
+ in the response.
348
+
349
+ Example query `default_zero(sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate())`
350
+ for the past 5 minutes.
351
+
352
+ 
353
+
354
+ To overcome this, you should manually apply a `rollup()` to your query, aggregating at least one complete reporting
355
+ interval of your application (in this case : 15 seconds).
356
+
357
+ # ### Pitfall 2: Datadog metrics tend to return incomplete (thus usually too small) values for the most recent time intervals
358
+
359
+ Example query : ` sum:nginx_ingress.controller.requests{env:development, ingress:my-ingress} by {env}.as_rate().rollup(15)`
360
+
361
+ 
362
+
363
+ The rightmost bar displays a smaller value, because not all targets contributing to the metric have reported
364
+ the most recent time interval yet. In extreme cases, the value will be zero. As time goes by, this bar will fill,
365
+ but the most recent bar(s) are almost always incomplete. Sometimes, the Datadog UI shades the last bucket in the
366
+ example as incomplete, but, this "incomplete data" information is not part of the returned time series, so Flagger
367
+ cannot know which samples to trust.
368
+
369
+ # ### Recommendations on Datadog metrics evaluations
370
+
371
+ Flagger queries Datadog for an interval between and `analysis.metrics.interval` ago and `now` , and
372
+ then (since release (TODO : unreleased)) takes the **first** sample of the result set. It cannot take the
373
+ last one, because recent samples might be incomplete. So, for an interval of e.g. `2m`, Flagger evaluates
374
+ the value from 2 minutes ago.
375
+
376
+ - In order to have a result that is not oscillating, you should apply a rollup of at least the reporting interval of
377
+ the observed target
378
+ - In order to have a recent result, you should use a small interval, but...
379
+ - In order to have a complete result, you must take a query interval that contains at least one full rollup window.
380
+ This should be the case if the interval is at least two times the rollup window
381
+ - In order to always have a metric result, you can apply functions like `default_zero()`, but you must
382
+ make sure that receiving a zero does not fail your evaluation
383
+
329
384
# # Amazon CloudWatch
330
385
331
386
You can create custom metric checks using the CloudWatch metrics provider.
@@ -438,11 +493,11 @@ spec:
438
493
secretRef:
439
494
name: newrelic
440
495
query: |
441
- SELECT
442
- filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
496
+ SELECT
497
+ filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
443
498
sum(nginx_ingress_controller_requests) * 100
444
- FROM Metric
445
- WHERE metricName = 'nginx_ingress_controller_requests'
499
+ FROM Metric
500
+ WHERE metricName = 'nginx_ingress_controller_requests'
446
501
AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
447
502
` ` `
448
503
@@ -538,7 +593,7 @@ spec:
538
593
# # Google Cloud Monitoring (Stackdriver)
539
594
540
595
Enable Workload Identity on your cluster, create a service account key that has read access to the
541
- Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger
596
+ Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger
542
597
service account on Kubernetes. You can take a look at this [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity)
543
598
544
599
Annotate the flagger service account
@@ -557,7 +612,7 @@ your [service account json](https://cloud.google.com/docs/authentication/product
557
612
kubectl create secret generic gcloud-sa --from-literal=project=<project-id >
558
613
```
559
614
560
- Then reference the secret in the metric template.
615
+ Then reference the secret in the metric template.
561
616
Note: The particular MQL query used here works if [Istio is installed on GKE](https://cloud.google.com/istio/docs/istio-on-gke/installing).
562
617
```yaml
563
618
apiVersion: flagger.app/v1beta1
@@ -568,7 +623,7 @@ metadata:
568
623
spec:
569
624
provider:
570
625
type: stackdriver
571
- secretRef:
626
+ secretRef:
572
627
name: gcloud-sa
573
628
query: |
574
629
fetch k8s_container
@@ -725,7 +780,7 @@ This will usually be set to the same value as the analysis interval of a `Canary
725
780
Only relevant if the `type` is set to `analysis`.
726
781
* **arguments (optional)**: Arguments to be passed to an `Analysis`.
727
782
Arguments are passed as a list of key value pairs, separated by `;` characters,
728
- e.g. `foo=bar;bar=foo`.
783
+ e.g. `foo=bar;bar=foo`.
729
784
Only relevant if the `type` is set to `analysis`.
730
785
731
786
For the type `analysis`, the value returned by the provider is either `0`
0 commit comments