You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clean up / document metrics monitor fields (#39413)
Document the semantics of many metrics monitoring variables, and rename some metrics APIs to more clearly indicate their function. As a side effect, fix several metrics reporting / publishing bugs in the Elasticsearch output, including #39146.
Add the `output.events.dead_letter` metric to distinguish events that were ingested to the dead letter index after a fatal error (previously these events were just reported as "acked").
This could have been a shorter fix, but it was hard to properly test since the metrics were changed from two separate functions with a lot of special cases. I ended up reorganizing the Elasticsearch `Publish` helpers to make the logic more clear. The new layout makes it much easier to test the error handling and metrics reporting.
The bugs fixed by this refactor are:
- When a batch was split, the events in it were not reported to the observer via `RetryableErrors`. This caused `activeEvents` to increase permanently even after the events were handled.
- When a previously-failed event was ingested to the dead letter index as a raw string, it was reported to the observer as `Acked` (success). The new logic creates a new `dead_letter` metric specifically for this case.
- When a previously-failed event encountered a fatal error ingesting to the dead letter index:
* It was reported to the observer as both a permanent error and a retryable error, giving incorrect event counts. It's now reported as just a permanent error.
* It was added to the retry list anyway, which would block all ingestion if the error really was fatal since the queue could never advance past that event. It's now dropped, the same as with a permanent error in the main index.
- If the Elasticsearch bulk index response was invalid, the associated events were dropped and reported as acknowledged
0 commit comments