Skip to content

Use events as foundation for driving alerts & incident management #4729

@thompson-tomo

Description

@thompson-tomo

What are you trying to achieve?

I want to have my telemetry signals (metrics, spans & logs) as the driver for alerting systems such as alert manager with a focus on interoperability and enablement of the alert to be associated with the source. This association is vital for enabling user navigation etc.

Key reason for extending events is the purpose of an event is different to that of an alert and also that of an incident. The distinction being:

  • Event captures when something happened
  • Alert informs an audience in realtime that they need to know about what is happening
  • Incident captures that something has occurred which someone needs to look into, potentially address and the organisations incident management processes need to be followed including communicating with stakeholders.

Initial implementation thoughts

On log records we introduce a new field log_record_id which is used as unique identifier. This would be for logs what span_id is for spans and would enable the log.record.uid semconv attribute to be deprecated just like was done for event.name. With the value being auto generated if not supplied.

We also introduce a log record field event_kind which contains alert, event & incident with Event being default when an event name is provided. This is a porting of event.kind from ecs. Alternatively we name it type with the same options.

A log record field event_state would enable us to optionally only send log records when the state changes hence would reduce data transmission. This field would only be required when event_kind is alert or incident but no other. This field would have options of active, created, monitoring & resolved.

The metric exemplar object is extended to include the log_record_id field to enable the examplars to also provide links to the event.

Use cases

Metric Alerting

  1. Aggregated Metric is processed by a pipeline comparing value against event thresholds. If necessary creating an event and adding the log_record_id value to an examplars.
  2. Alert system consumes otlp log stream filtering for event_kind = alert with other log records dropped. Ideally the filtering would be done in the collector. The event attributes are used by alerting system to deduplicate the events etc to ensure that only necessary alerts are raised and presented to a user.

Direct alerting

This would be the same as metric alerting except that rather than using metric values to determine if an alert is needed the instrumentation would directly report the alert and is useful when scraping data ie health checks

Promoting an alert to an incident

  1. An alert in an can be promoted to being an incident either automatically based on rules in the alerting system or a user triggering the escalation via ui. This escalation results in a new event being raised with event_kind = incident. This event contains the same attributes as the alert with the addition of an additional attribute alert.id for transporting the id of the alert to the incident management system.
  2. Incident management system consumes otlp log stream filtering for event_kind = incident with other log records dropped. Ideally the filtering would be done in the collector. Once filtered a new incident

Raising incidents

This would be the same as the previous except that the alert is internal to the system ie ci/CD hence the incident is directly raised.

Additional context.

It would be great to get input from prometheus as they could leverage this to improve alert manager and use more otlp natively.

At the same time incidents are a topic for ci/cd phase 2 open-telemetry/semantic-conventions#1185

Tip: React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    spec:logsRelated to the specification/logs directory

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions