-
Notifications
You must be signed in to change notification settings - Fork 44
Description
We've been running consul-esm for a long time, but with critical_threshold
and passing_threshold
both set to the default of 0. This means that after a single failed healthcheck, the service will be marked as critical in Consul.
To better deal with transient failures, we'd like to increase the number of failed healthchecks before Consul is notified that the service is critical. From the README, we expected the critical_threshold
to control this value:
To prevent flapping, thresholds for updating a check status can be configured by passing_threshold and `critical_threshold` such that a check will update and switch to be passing / critical after an additional number of consecutive or non-consecutive checks.
The issue is that there are two separate ways that consul-esm counts failures, both configured by critical_threshold
. The first is actually in the consul agent dependency, which is configured by consul-esm here: https://github.com/hashicorp/consul-esm/blob/main/check.go#L143-L144
The logic for this function is here: https://github.com/hashicorp/consul/blob/main/agent/checks/check.go#L1212-L1253
The StatusHandler updateCheck function tracks the actual healthcheck successes and failures against the thresholds set by critical_threshold
. This is what we would expect to be configuring with the setting. However, this is only an internal account, it does not control what consul-esm outputs to Consul.
The second way that consul-esm counts failures is downstream of the first, in CheckRunner UpdateCheck: https://github.com/hashicorp/consul-esm/blob/main/check.go#L352-L413
This also uses the critical_threshold
and passing_threshold
values, but the input is the number of times the check was failed only after it has first been marked as above the failure threshold in StatusHandler. This second counter has additional "consecutive" logic for decrementing on success. Once this function finally counts a check as being critical, it will update the output to Consul as critical.
This double-accounting has a few implications:
- It takes at least 2x
critical_threshold
failed healthchecks to get a service marked as critical in Consul, and often many more than that (see below examples) - As an operator, I do not have the expected control over the number of failures required to be marked as Critical
- Because the setting changes an internal counter in-between the StatusHandler and the output, I have to inspect logs of internal consul-esm state to understand the healthcheck failures that led to the observed Consul status.
I also found the docs misleading:
For an example of how non-consecutive checks are counted, we have a check that has the status 'passing', `critical_threshold`=3, and the counter is at 0 (c=0). The following pattern of pass/fail will decrement/increment the counter as such:
PASS (c=0), FAIL (c=1), FAIL (c=2), PASS (c=1), FAIL (c=2), FAIL (c=3), PASS (c=2), FAIL (c=3), FAIL (c=4)
When the counter reaches 4 (1 initial fail + 3 additional fails), the `critical_threshold` is met and the check status will update to 'critical' and the counter will reset.
Note: this implementation diverges from Consul's anti-flapping thresholds, which counts total consecutive checks.
While this is a correct description of how the internal second-level accounting works, the passes and failures counted are internal and are not the inputs to the system (the healthchecks). If you set critical_threshold
= 3 and your service responds with the (p)ass/(f)ail pattern p,f,f,p,f,f,p,f,f
, consul-esm will never mark the service as critical in Consul. You need to provide four consecutive failures broken up by a success (p,f,f,f,f
repeating), but it will take 15 healthchecks before the service is finally marked as critical in Consul before reverting back to passing.
We end up with something like this, which is not documented and hard to reason about:
critical_threshold setting |
Number health-check failures before critical in consul |
---|---|
0 | 1 |
1 | 2 - 3 |
2 | 4 - 8 |
3 | 8 - 15 |
My recommendation is to remove the second level of accounting entirely. While it may be a breaking change, the current setting behavior is not accurately described in the documentation, and is not granular enough to let operators make useful changes to the behavior. Passing through the output of the StatusHandler directly makes the Consul output behave as expected.
Reproducer and details
Create a dummy service that will respond to health checks in a pattern we define:
% ./healthy --pattern=p,f,f,f
2025/07/07 18:22:42 Listening at :8000/consul 2025/07/07 18:22:42 Using pattern: [p f f f]
2025/07/07 17:25:24 Returning 200
2025/07/07 17:25:39 Returning 500
2025/07/07 17:25:54 Returning 500
2025/07/07 17:26:09 Returning 500
2025/07/07 17:26:24 Returning 200
2025/07/07 17:26:39 Returning 500
2025/07/07 17:26:54 Returning 500
2025/07/07 17:27:09 Returning 500
And test the consul service status by probing the Consul instance directly after each response:
curl -X GET -ksS "http://localhost:9500/v1/health/service/test-healthy" | jq '.[] | .Checks[] .Status'
Note that consul recommends starting checks as critical so that they must pass health checks before receiving traffic - this setting does not affect the outcome
Run Consul with default settings, and register a service with http healthchecks.
Run consul-esm with basic config, using the the (current) latest version v0.9.0.
datacenter = "dc"
http_addr = "localhost:9500"
log_level = "INFO"
passing_threshold = 0
`critical_threshold` = 3 <-- matches the readme value
enable_debug = true
client_address = "localhost:9999"
With critical_threshold
= 0 (default)
The output flaps as expected:
pattern: p,f
| service response | consul-esm log | consul api status |
|------------------|-----------------------|-------------------|
| Returning 200 | | "passing" |
| Returning 500 | Check is now critical | "critical" |
| Returning 200 | | "passing" |
| Returning 500 | Check is now critical | "critical" |
| Returning 200 | | "passing" |
| Returning 500 | Check is now critical | "critical" |
With critical_threshold
= 1
The check now stays passing
| service response | consul-esm log | consul api status |
|------------------|----------------|-------------------|
| Returning 200 | | "passing" |
| Returning 500 | | "passing" |
| Returning 200 | | "passing" |
| Returning 500 | | "passing" |
| Returning 200 | | "passing" |
| Returning 500 | | "passing" |
With a second failure, critical_threshold
= 1 requires two failures, this matches expectations:
pattern: p,f,f
| service response | consul-esm log | consul api status |
|------------------|----------------------------------------------------|-------------------|
| Returning 200 | | "passing" |
| Returning 500 | Check is now critical | "passing" |
| Returning 500 | Check is now critical + Updating output and status | "critical" |
| Returning 200 | | "passing" |
| Returning 500 | Check is now critical | "passing" |
| Returning 500 | Check is now critical + Updating output and status | "critical" |
The problems start when we go higher. With critical_threshold = 2, we expect after the third failure that it gets marked critical, but it takes 8 failures instead:
pattern: p,f,f,f
| service response | consul-esm log | consul api status |
|------------------|----------------------------------------------------|-------------------|
| Returning 200 | | "passing" |
| Returning 500 | Check failed but has not reached failure threshold | "passing" |
| Returning 500 | Check is now critical | "passing" |
| Returning 500 | Check is now critical | "passing" |
| Returning 200 | Check failed but has not reached failure threshold | "passing" |
| Returning 500 | Check is now critical | "passing" |
| Returning 500 | Check is now critical | "passing" |
| Returning 500 | Check is now critical + Updating output and status | "critical" |
Adding the internal, second-level accounting with the decrement logic shows why this is happening. C is only incremented for each failure after StatusHandler has initially crossed the threshold as well and first marked the status as critical.
| service response | consul-esm log | consul api status | internal consul-esm accounting |
|------------------|----------------------------------------------------|-------------------|--------------------------------|
| Returning 200 | | "passing" | c = 0 |
| Returning 500 | Check failed but has not reached failure threshold | "passing" | c = 0 |
| Returning 500 | Check is now critical | "passing" | c = 0 |
| Returning 500 | Check is now critical | "passing" | c = 1 |
| Returning 200 | Check failed but has not reached failure threshold | "passing" | c = 0 |
| Returning 500 | Check is now critical | "passing" | c = 1 |
| Returning 500 | Check is now critical | "passing" | c = 2 |
| Returning 500 | Check is now critical + Updating output and status | "critical" | c = 3 |
More examples with critical_threshold
= 2, here it takes 4 failed healthchecks:
| service response | consul-esm log | consul api status | internal consul-esm accounting
|------------------|----------------------------------------------------|-------------------|
| Returning 200 | | "passing" | c = 0
| Returning 500 | Failed but has not reached failure threshold | "passing" | c = 0
| Returning 500 | Check is now critical | "passing" | c = 1
| Returning 500 | Check is now critical | "passing" | c = 2
| Returning 500 | Check is now critical + Updating output and status | "critical" | c = 3
| Returning 200 | Failed but has not reached failure threshold | "passing" | c = 0
| Returning 500 | Failed but has not reached failure threshold | "passing" | c = 0
| Returning 500 | Check is now critical | "passing" | c = 1
| Returning 500 | Check is now critical | "passing" | c = 2
| Returning 500 | Check is now critical + Updating output and status | "critical" | c = 3
```
</details