Skip to content

Conversation

@braydonk
Copy link
Contributor

@braydonk braydonk commented Nov 12, 2025

Description

This PR adds a new partial error type. This is for scenarios where a destination has reported that some number of items failed, without giving the consumer the ability to figure out which particular items failed.

The error can be treated equivalently to a permanent error; the only change this would make to a typical permanent error scenario is that consumers will now have the ability to report the right numbers to a statistic tracker like exporterhelper.obsReportSender.

Link to tracking issue

Part of #13423
Re-opened from #13927

Testing

I have a branch off this one that shows how the exporterhelper package would leverage this error type to extract the proper failed item counts for metrics. https://github.com/braydonk/opentelemetry-collector/compare/partial_error...braydonk:opentelemetry-collector:count_partial_errors?expand=1

Documentation

Documentation is primarily in the godoc of the public API. I'm not sure if there's any better places to include this information as well.

@braydonk braydonk requested a review from a team as a code owner November 12, 2025 14:52
@braydonk braydonk requested a review from atoulme November 12, 2025 14:52
This PR adds functionality for consumers to create a partial error type.
This will allow consumers to properly report partial success/failure with
failed item counts, which can subsequently be used when reporting
sent/failed metrics.
@braydonk
Copy link
Contributor Author

braydonk commented Nov 12, 2025

Replicating @jmacd's comment on the original PR so the discussion can take place here instead.


@jmacd said:

I would like to see a draft, at least, for the documentation that will accompany the package, what the consumers need to know. For example, in the fanout consumer, existing logic returns immediately when one of the fanned-out consumers returns a non-nil error. I believe we need a blanket recommendation that these errors may be treated the same as success in cases like these.

However even while continuing the fan-out, the returned error should join all the partial successes with good information that an OTLP receiver can convey to SDKs and user consoles. Here are some general rules that might make sense to apply:

  • When applying the same request multiple times: concatenate partial-success message strings, take the maximum number of rejected items
  • When combining requests as in a batch processor, prepend "batch rejected %d items: " to the partial-success message, set number rejected to 0

I've tried to extend the documentation of the public API as much as I can to address the goals of this PR. This error in particular is to allow a consumer to express partial success as a permanent error. In consumererror there are nominally referred to as "signal errors" which allow a consumer to express a retryable partial error where the exact failed items are known. This is for the scenario where a consumer knows how many items failed, but not which ones.

The semantics of how it should be treated are kind of up to whoever is upstream of the consumer receiving the error. Most may just treat it as a permanent error, with the option of extracting the count of failed items if that's of any interest.

In the case of the fanoutconsumer, I don't think it's necessarily the concern of this package exactly how the error is handled by any particular upstream component. If the things the fanoutconsumer fanned out to produce this error and it opts to make a different choice for how to proceed based on detection of this error, I don't see that as being specifically of concern to the consumer that would be producing this error. I also don't see it as a concern of this package's documentation necessarily. This is my first contribution to this package so I'm not sure if I'm totally off-base here, but the way I'm considering this PR is that it allows a new wrapped way for a consumer to express something that can't be done today, in a way that leverages the existing "permanent error" semantics.

The linter did not like exporting a private type, which is a fair point.
I want to keep the error type private so that the only way to produce a
partial error also necessitates that it's permanent, so I changed the
API to `IsPartial` which produces a count and a boolean.
@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.25%. Comparing base (5c13c75) to head (4edad6b).
⚠️ Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14152      +/-   ##
==========================================
+ Coverage   92.24%   92.25%   +0.01%     
==========================================
  Files         658      659       +1     
  Lines       41171    41199      +28     
==========================================
+ Hits        37978    38008      +30     
+ Misses       2185     2184       -1     
+ Partials     1008     1007       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant