Message Delivery Rate alert rule and overview panel should use deriv
instead of rate
#67
Labels
Type: Bug
Fix something that isn't working as intended
The
rate
function is only for use withCOUNTER
metrics (values that only go up or are reset to 0, but never go down) and notGAUGE
metrics (values that can go up or down). Basically therate
function has special logic to deal with the "reset to zero" case, but that logic behaves weirdly when the metric value starts going down.The
deriv
function should be used for anyGAUGE
type metrics instead ofrate
. It is a similar calculate for the rate of change, but does not account for "reset to zero" (and it does not break if the metric value goes down).The Message Delivery Rate alert rule (and panel) is an interesting case since technically the
cht_messaging_outgoing_total
metric value can never go down (and so is aCOUNTER
). However our calculations are based on values recorded for thefailed
anddelivered
labels. The values for these labels are based on data reported to the CHT by the external messaging platforms. While it seems like the number of messages in thefailed
status should never go down, based on prod data recorded on the Allies Watchdog instance, it apparently is possible. So, for these queries where we are filteringcht_messaging_outgoing_total
by the status label, we should usederiv
instead ofrate
to properly measure the change.The text was updated successfully, but these errors were encountered: