critical_threshold does not work as expected

We've been running consul-esm for a long time, but with `critical_threshold` and `passing_threshold` both set to the default of 0. This means that after a single failed healthcheck, the service will be marked as critical in Consul.

To better deal with transient failures, we'd like to increase the number of failed healthchecks before Consul is notified that the service is critical. From the README, we expected the `critical_threshold` to control this value:

```
To prevent flapping, thresholds for updating a check status can be configured by passing_threshold and `critical_threshold` such that a check will update and switch to be passing / critical after an additional number of consecutive or non-consecutive checks.
```

The issue is that there are two separate ways that consul-esm counts failures, both configured by `critical_threshold`. The first is actually in the consul agent dependency, which is configured by consul-esm here: https://github.com/hashicorp/consul-esm/blob/main/check.go#L143-L144

The logic for this function is here: https://github.com/hashicorp/consul/blob/main/agent/checks/check.go#L1212-L1253

The StatusHandler updateCheck function tracks the actual healthcheck successes and failures against the thresholds set by `critical_threshold`. This is what we would expect to be configuring with the setting. However, this is only an internal account, it does not control what consul-esm outputs to Consul.

The second way that consul-esm counts failures is downstream of the first, in CheckRunner UpdateCheck: https://github.com/hashicorp/consul-esm/blob/main/check.go#L352-L413

This also uses the `critical_threshold` and `passing_threshold` values, but the input is the number of times the check was failed only after it has first been marked as above the failure threshold in StatusHandler. This second counter has additional "consecutive" logic for decrementing on success. Once this function finally counts a check as being critical, it will update the output to Consul as critical.

This double-accounting has a few implications:
- It takes at least 2x `critical_threshold` failed healthchecks to get a service marked as critical in Consul, and often many more than that (see below examples)
- As an operator, I do not have the expected control over the number of failures required to be marked as Critical
- Because the setting changes an internal counter in-between the StatusHandler and the output, I have to inspect logs of internal consul-esm state to understand the healthcheck failures that led to the observed Consul status.

I also found the docs misleading:
```
For an example of how non-consecutive checks are counted, we have a check that has the status 'passing', `critical_threshold`=3, and the counter is at 0 (c=0). The following pattern of pass/fail will decrement/increment the counter as such:

PASS (c=0), FAIL (c=1), FAIL (c=2), PASS (c=1), FAIL (c=2), FAIL (c=3), PASS (c=2), FAIL (c=3), FAIL (c=4)

When the counter reaches 4 (1 initial fail + 3 additional fails), the `critical_threshold` is met and the check status will update to 'critical' and the counter will reset.

Note: this implementation diverges from Consul's anti-flapping thresholds, which counts total consecutive checks.
```

While this is a correct description of how the internal second-level accounting works, the passes and failures counted are internal and are not the inputs to the system (the healthchecks). If you set `critical_threshold` = 3 and your service responds with the (p)ass/(f)ail pattern `p,f,f,p,f,f,p,f,f`, consul-esm will never mark the service as critical in Consul. You need to provide four consecutive failures broken up by a success (`p,f,f,f,f` repeating), but it will take 15 healthchecks before the service is finally marked as critical in Consul before reverting back to passing.

We end up with something like this, which is not documented and hard to reason about:

| `critical_threshold` setting | Number health-check failures before critical in consul |
|----------------------------|--------------------------------------------------------|
| 0                          | 1                                                      |
| 1                          | 2 - 3                                                  |
| 2                          | 4 - 8                                                  |
| 3                          | 8 - 15                                                 |

My recommendation is to remove the second level of accounting entirely. While it may be a breaking change, the current setting behavior is not accurately described in the documentation, and is not granular enough to let operators make useful changes to the behavior. Passing through the output of the StatusHandler directly makes the Consul output behave as expected.

<details>

<summary>Reproducer and details</summary>


Create a dummy service that will respond to health checks in a pattern we define:
```
% ./healthy --pattern=p,f,f,f
2025/07/07 18:22:42 Listening at :8000/consul                                                                                                                                                                                                       2025/07/07 18:22:42 Using pattern: [p f f f] 
2025/07/07 17:25:24 Returning 200                                                                                                                                                                                   
2025/07/07 17:25:39 Returning 500                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
2025/07/07 17:25:54 Returning 500                                                                                                                                                                                                                   
2025/07/07 17:26:09 Returning 500                                                                                                                                                                                                                   
2025/07/07 17:26:24 Returning 200                                                                                                                                                                                                                   
2025/07/07 17:26:39 Returning 500                                                                                                                                                                                                                   
2025/07/07 17:26:54 Returning 500                                                                                                                                                                                                                   
2025/07/07 17:27:09 Returning 500
```

And test the consul service status by probing the Consul instance directly after each response:
```
curl -X GET -ksS "http://localhost:9500/v1/health/service/test-healthy" | jq '.[] | .Checks[] .Status'
```

Note that consul recommends starting checks as critical so that they must pass health checks before receiving traffic - this setting does not affect the outcome

Run Consul with default settings, and register a service with http healthchecks.

Run consul-esm  with basic config, using the the (current) latest version v0.9.0.
```
datacenter = "dc"
http_addr = "localhost:9500"
log_level = "INFO"
passing_threshold = 0
`critical_threshold` = 3 <-- matches the readme value
enable_debug = true
client_address = "localhost:9999"
```

With `critical_threshold` = 0 (default)
The output flaps as expected:
```
pattern: p,f
| service response | consul-esm log        | consul api status |
|------------------|-----------------------|-------------------|
| Returning 200    |                       | "passing"         |
| Returning 500    | Check is now critical | "critical"        |
| Returning 200    |                       | "passing"         |
| Returning 500    | Check is now critical | "critical"        |
| Returning 200    |                       | "passing"         |
| Returning 500    | Check is now critical | "critical"        |
```

With `critical_threshold` = 1
The check now stays passing
```
| service response | consul-esm log | consul api status |
|------------------|----------------|-------------------|
| Returning 200    |                | "passing"         |
| Returning 500    |                | "passing"         |
| Returning 200    |                | "passing"         |
| Returning 500    |                | "passing"         |
| Returning 200    |                | "passing"         |
| Returning 500    |                | "passing"         |
```

With a second failure, `critical_threshold` = 1 requires two failures, this matches expectations:
```
pattern: p,f,f

| service response | consul-esm log                                     | consul api status |
|------------------|----------------------------------------------------|-------------------|
| Returning 200    |                                                    | "passing"         |
| Returning 500    | Check is now critical                              | "passing"         |
| Returning 500    | Check is now critical + Updating output and status | "critical"        |
| Returning 200    |                                                    | "passing"         |
| Returning 500    | Check is now critical                              | "passing"         |
| Returning 500    | Check is now critical + Updating output and status | "critical"        |

```

The problems start when we go higher. With critical_threshold = 2, we expect after the third failure that it gets marked critical, but it takes 8 failures instead:

```
pattern: p,f,f,f
| service response | consul-esm log                                     | consul api status |
|------------------|----------------------------------------------------|-------------------|
| Returning 200    |                                                    | "passing"         |
| Returning 500    | Check failed but has not reached failure threshold | "passing"         |
| Returning 500    | Check is now critical                              | "passing"         |
| Returning 500    | Check is now critical                              | "passing"         |
| Returning 200    | Check failed but has not reached failure threshold | "passing"         |
| Returning 500    | Check is now critical                              | "passing"         |
| Returning 500    | Check is now critical                              | "passing"         |
| Returning 500    | Check is now critical + Updating output and status | "critical"        |


```

Adding the internal, second-level accounting with the decrement logic shows why this is happening. C is only incremented for each failure after StatusHandler has initially crossed the threshold as well and first marked the status as critical.

```
| service response | consul-esm log                                     | consul api status | internal consul-esm accounting |
|------------------|----------------------------------------------------|-------------------|--------------------------------|
| Returning 200    |                                                    | "passing"         | c = 0                          |
| Returning 500    | Check failed but has not reached failure threshold | "passing"         | c = 0                          |
| Returning 500    | Check is now critical                              | "passing"         | c = 0                          |
| Returning 500    | Check is now critical                              | "passing"         | c = 1                          |
| Returning 200    | Check failed but has not reached failure threshold | "passing"         | c = 0                          |
| Returning 500    | Check is now critical                              | "passing"         | c = 1                          |
| Returning 500    | Check is now critical                              | "passing"         | c = 2                          |
| Returning 500    | Check is now critical + Updating output and status | "critical"        | c = 3                          |
```

More examples with `critical_threshold` = 2, here it takes 4 failed healthchecks:

````
| service response | consul-esm log                                     | consul api status | internal consul-esm accounting
|------------------|----------------------------------------------------|-------------------|
| Returning 200    |                                                    | "passing"         | c = 0
| Returning 500    | Failed but has not reached failure threshold       | "passing"         | c = 0
| Returning 500    | Check is now critical                              | "passing"         | c = 1
| Returning 500    | Check is now critical                              | "passing"         | c = 2
| Returning 500    | Check is now critical + Updating output and status | "critical"        | c = 3
| Returning 200    | Failed but has not reached failure threshold       | "passing"         | c = 0
| Returning 500    | Failed but has not reached failure threshold       | "passing"         | c = 0
| Returning 500    | Check is now critical                              | "passing"         | c = 1
| Returning 500    | Check is now critical                              | "passing"         | c = 2
| Returning 500    | Check is now critical + Updating output and status | "critical"        | c = 3
```
</details


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

critical_threshold does not work as expected #315

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

critical_threshold does not work as expected #315

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions