-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Component(s)
exporter/otlphttp
What happened?
Describe the bug
According to the OpenTelemetry specification, only a specific set of response codes should be considered retryable (see: https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#retryable-response-codes). However, when the retry_on_failure configuration option is enabled (enabled=true, which is the default), the collector appears to retry all 400 errors as well.
We observe this behavior on collectors that write to our Mimir via exporter/otlphttp: Mimir responds with 400 errors due to “metric too old,” and these non-retryable errors are stored in the local queues -which should not happen-. This leads to queue growth and, probably, unnecessary retries.
I would expect the collector to automatically discard non-retryable errors, regardless of the retry_on_failure setting.
Steps to reproduce
- Configure an exporter with
retry_on_failure.enabled = true(default). - Send metrics that Mimir rejects with a 400 “metric too old” error.
- Observe that the collector enqueues and retries these requests instead of discarding them.
What did you expect to see?
Non-retryable errors (e.g., 400 for invalid or too-old data) should be dropped immediately and not queued or retried.
What did you see instead?
The collector treats 400 errors as retryable (presumably), stores them in the local queue, and retries them indefinitely, contrary to the OpenTelemetry specification.
Collector version
v0.137.0
Environment information
Environment
OS: Alma Linux 9
OpenTelemetry Collector configuration
Log output
Additional context
No response
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.