Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

iblancasa · 2025-10-09T14:46:47Z

Description

Add a dedicated experr.NewRetriesExhaustedErr wrapper so exporters can detect when all retry attempts failed
Record new otelcol_exporter_retry_dropped_{spans,metric_points,log_records} counters when retries are exhausted, alongside existing send-failed metrics

Link to tracking issue

codecov · 2025-10-09T15:42:15Z

Codecov Report

❌ Patch coverage is 98.76543% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 92.15%. Comparing base (587a7d2) to head (38bbaec).

Files with missing lines	Patch %	Lines
exporter/exporterhelper/internal/retry_sender.go	90.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13957      +/-   ##
==========================================
+ Coverage   92.13%   92.15%   +0.01%     
==========================================
  Files         666      666              
  Lines       41438    41515      +77     
==========================================
+ Hits        38180    38257      +77     
  Misses       2218     2218              
  Partials     1040     1040

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

iblancasa · 2025-10-21T10:47:59Z

@open-telemetry/collector-approvers can you take a look?

jmacd

Non-blocking feedback (cc @jade-guiton-dd @axw).

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

I guess these questions lead me to suspect that it's the queue (not the retry sender) that should count drops which are requests that fail and have no upstream response returned because wait_for_result=false. Otherwise, failures are failures, I see no reason to count them in a new way.

iblancasa · 2025-10-28T11:18:15Z

Thanks for your always valuable feedback @jmacd :D

Question 1: The universal telemetry RFC describes the use of an attribute otelcol.component.outcome=failure to indicate when an export fails. Why would we need a separate counter to indicate when retry fails?

The RFC attribute only tells you whether a single export span ended in success or failure. It doesn’t say why it failed or how many items were lost. Before this change, the obsreport sender only knew that err != nil. It could increment otelcol_exporter_send_failed_*, but it couldn’t tell whether the failure was because retries were exhausted, a permanent error was returned on the first attempt, the context was cancelled, the collector shut down, etc.

By having the retry sender wrap the terminal error with experr.NewRetriesExhaustedErr, the obsreport sender can now distinguish “we ran out of retries” from other failure cases. We found this metric valuable in the past because that distinction matters operationally: running out of retries usually points to a long-lived availability problem on the destination side, while other failures (permanent errors, shutdown, context cancellation) have different remediation.

Question 2: If the exporterhelper is configured with wait_for_result=true then it's difficult to call these failures "drops". Wouldn't the same sort of "drop" happen if the queue is configured (without wait_for_result=true) but also without the retry processor?

wait_for_result only controls whether the queue’s Offer call waits for the downstream sender to finish. When it’s true, upstream components see the error immediately; when false, they don’t. In both cases, once the retry sender gives up the data is gone—the collector has accepted it but cannot deliver it. So it still qualifies as a drop.

The queue already accounts for the situations it is responsible for (otelcol_exporter_enqueue_failed_* covers queue-capacity drops). What it cannot know is why the downstream sender failed. It simply forwards the error it gets back.

In the configuration you mentioned (queue enabled, wait_for_result=false, retry disabled), the queue returns success to the producer, the exporter fails, and obsReportSender.endOp increments otelcol_exporter_send_failed_*. No retry ever ran, so the new retry-drop counter remains zero. That’s intentional: the terminal failure was due to a permanent error, not because a retry budget was exhausted. Conversely, when retries are enabled and eventually fail, the retry sender wraps the error, the obsreport sender increments both send_failed and the new retry_dropped counter. Upstream may or may not have seen the error depending on wait_for_result, but the counter captures the fact that “we tried retrying and still had to drop these items.”

So the queue doesn’t have enough context to produce a “retry exhausted” metric, while the retry sender does. That’s why the new counters live alongside the retry logic instead of inside the queue.

jade-guiton-dd · 2025-10-28T11:54:49Z

(For the record, the type of failure that occurred is already visible in logs. Of course, that doesn't mean we can't also surface it as metrics.)

…r exporter helper retries Signed-off-by: Israel Blancas <[email protected]>

…etry#13957 Signed-off-by: Jayson Cena <[email protected]>

iblancasa requested review from a team, bogdandrutu and dmitryax as code owners October 9, 2025 14:46

iblancasa force-pushed the 13956 branch 3 times, most recently from 869d3f8 to ddfd2b6 Compare October 21, 2025 10:47

iblancasa force-pushed the 13956 branch from ddfd2b6 to 923abe0 Compare October 21, 2025 10:58

jmacd reviewed Oct 27, 2025

View reviewed changes

iblancasa force-pushed the 13956 branch 3 times, most recently from 3ca8666 to fbd3281 Compare November 10, 2025 14:33

iblancasa requested review from andrzej-stencel, dmathieu, evan-bradley and mx-psi as code owners November 10, 2025 14:33

iblancasa force-pushed the 13956 branch from fbd3281 to 5f01b45 Compare November 11, 2025 16:42

Add retry dropped item metrics and an exhausted retry error marker fo…

9def8ed

…r exporter helper retries Signed-off-by: Israel Blancas <[email protected]>

iblancasa force-pushed the 13956 branch from 5f01b45 to 9def8ed Compare November 11, 2025 16:52

iblancasa added 2 commits November 12, 2025 19:46

Merge branch 'main' into 13956

d3d88be

Merge branch 'main' into 13956

735ee0e

jaysoncena added a commit to jaysoncena/opentelemetry-collector that referenced this pull request Nov 19, 2025

OBS-6820: add otelcol_exporter_retry_dropped_*, based from open-telem…

d130cd9

…etry#13957 Signed-off-by: Jayson Cena <[email protected]>

Merge branch 'main' into 13956

38bbaec

iblancasa requested a review from jmacd November 19, 2025 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

iblancasa commented Oct 9, 2025

Uh oh!

codecov bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

iblancasa commented Oct 21, 2025

Uh oh!

jmacd left a comment

Uh oh!

iblancasa commented Oct 28, 2025

Uh oh!

jade-guiton-dd commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Are you sure you want to change the base?

Add retry dropped item metrics and an exhausted retry error marker for exporter helper retries #13957

Conversation

iblancasa commented Oct 9, 2025

Description

Link to tracking issue

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

iblancasa commented Oct 21, 2025

Uh oh!

jmacd left a comment

Choose a reason for hiding this comment

Uh oh!

iblancasa commented Oct 28, 2025

Uh oh!

jade-guiton-dd commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Oct 9, 2025 •

edited

Loading