Skip to content

Design: Cloud-native alerting via OTel Collector + Managed Prometheus #227

@mikemcdougall

Description

@mikemcdougall

Goal

Keep alerting simple, standards-based, and cloud‑native, while remaining pluggable across AWS/GCP/Azure without running our own Prometheus/Grafana stack.

Direction (minimal + portable)

  • Emit OTLP (metrics/traces/logs) from the app using standard OTEL_* env vars (already supported).
  • OpenTelemetry Collector (DaemonSet or sidecar) receives OTLP and exports:
    • Metrics → managed Prometheus (remote_write)
    • Traces/logs → provider backend (or OTLP‑compatible vendor)
  • PromQL alert rules stored in repo as standard PrometheusRule YAML.
  • Provider adapters apply the same rule set via their managed services.

Collector (vendor‑agnostic shape)

receivers:
  otlp:
    protocols:
      grpc:
processors:
  batch:
exporters:
  prometheusremotewrite:
    endpoint: ${PROM_RW_ENDPOINT}
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]

Core alert rules (PromQL)

groups:
- name: honua-core
  rules:
  - alert: HonuaHigh5xxRate
    expr: sum(rate(honua_http_request_total{status_code=~"5.."}[5m]))
          / sum(rate(honua_http_request_total[5m])) > 0.05
    for: 5m
  - alert: HonuaHighLatencyP95
    expr: histogram_quantile(0.95,
          sum(rate(honua_http_request_duration_ms_bucket[5m])) by (le)) > 2000
    for: 5m

Provider notes (thin overlays)

  • AWS (AMP): remote_write with SigV4; rules uploaded to AMP rule groups; notifications via SNS or managed Grafana.
  • GCP (Managed Prometheus): remote_write with Workload Identity; rules as Cloud Monitoring PromQL policies.
  • Azure (Azure Monitor Prometheus): remote_write with Managed Identity; rules via Prometheus rule groups + Action Groups.

Tasks

  • Add docs/alerting/README.md describing the OTLP→collector→managed‑Prometheus flow.
  • Add docs/alerting/rules/honua-core.yaml (PromQL ruleset).
  • Add provider overlays/examples (docs/alerting/aws|gcp|azure) showing auth + rule publishing.

Acceptance criteria

  • Single ruleset used across clouds.
  • No self‑hosted Prometheus/Grafana required.
  • Collector config is small and vendor‑agnostic; provider auth only in overlays.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions