[exporter/otlp] Report runtime status #11366

mwear · 2024-10-04T22:07:34Z

Description

This PR adds runtime status reporting for the otlp and otlphttp exporters. It's an updated version of #8788 which was one of several approaches experimented with to add this functionality. This work was paused to allow the consumererror work to evolve, as it appeared as though there might be a way to uniformly apply the same logic to all exporters (via the exporterhelper), but that work has taken a different direction, and it looks like a uniform approach will not be possible.

This PR implements runtime status reporting as discussed in #9957. The choices for which statuses represent permanent errors are up for debate, but the key point is that a permanent error is an error discovered at runtime that will require user intervention to fix.

This implementation makes use of the finite state machine that underlies the status reporting system. The finite state machine will ensure that:

Only changes in status are reported. Repeat reports of the same status will no-op.
If a component transitions into a PermanentError all further status reports will no-op.

This means the exporter does not have to reason about current or previous statuses. It can report status based on its current view of the world and the status reporting system will handle the rest. Flapping between recoverable and ok is meant to be handled by watchers consuming status events. The healthcheckv2extension handles this by using a time based approach (e.g. recovery interval). Other watchers may choose to handle this situation differently.

For more information on status reporting, the state machine, etc see: https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md

While all components report status during start and shutdown (via automation), this is the first component to report runtime status. This will allow the healthcheckv2extension to replace the currently non-functioning check_collector_pipeline capability of the original healthcheckextension and should serve as an example for other components that wish to report runtime status moving forward.

Link to tracking issue

Implements #9957 for the otlp and otlphttp exporters

Testing

units/manual

Documentation

code comments

codecov · 2024-10-04T22:15:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.64%. Comparing base (f74890a) to head (2043eba).

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11366      +/-   ##
==========================================
+ Coverage   91.61%   91.64%   +0.03%     
==========================================
  Files         443      443              
  Lines       23770    23828      +58     
==========================================
+ Hits        21776    21837      +61     
+ Misses       1620     1618       -2     
+ Partials      374      373       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

bogdandrutu

Before implementation, can we document somewhere (ideal in RFC) or in https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md when an exporter is consider permanent error? I feel like we have different understandings of this and I don't want you to make lots of code changes until we agree on the basic definition.

exporter/otlpexporter/otlp.go

mwear · 2024-10-18T16:35:04Z

Before implementation, can we document somewhere (ideal in RFC) or in https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/component-status.md when an exporter is consider permanent error?

I have these documented here: #9957. Is it ok to discuss on that issue, or would you prefer it to be an RFC?

mwear · 2024-10-18T16:48:37Z

If you're suggesting that the errors identified as Permanent in this PR (and #9957) are in many cases recoverable, we can identify all errors as recoverable and skip permanent for now.

bogdandrutu · 2024-10-18T16:50:34Z

I am not the biggest expert, but I feel at least some are recoverable, and some are contextual (example auth when passing auth tokens from the client).

If you're suggesting that the errors identified as Permanent in this PR (and #9957) are in many cases recoverable, we can identify all errors as recoverable and skip permanent for now.

I think would be safer correct?

mwear · 2024-10-18T17:57:52Z

I think going with recoverable for all errors, at least for now, makes sense. We can discuss what errors should be permanent on #9957, but I suspect we'll be able to come up with counter arguments for most of the cases.

mwear · 2024-10-18T22:24:49Z

I updated this PR to handle all errors as recoverable. If we are happy with this approach, we could consider making this an option for all exporters via exporter helper similar to: #8684 likely as a followup.

exporter/otlpexporter/otlp.go

bogdandrutu · 2024-10-22T00:07:46Z

exporter/otlpexporter/otlp.go

+		componentstatus.ReportStatus(e.host, componentstatus.NewRecoverableErrorEvent(err))
+		return
+	}
+	componentstatus.ReportStatus(e.host, componentstatus.NewEvent(componentstatus.StatusOK))


Are we not concerned for every request to report the status? Should we do it periodically, or only when changes?

The finite state machine handles this for us. It will report status only if there has been a change, otherwise it will no-op.

github-actions · 2024-11-12T03:16:34Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

mwear · 2024-11-23T00:10:17Z

Would anyone be interested in reviewing this, or at least removing the stale label?

github-actions · 2024-12-07T03:22:50Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

atoulme · 2024-12-11T05:48:03Z

exporter/otlphttpexporter/otlp.go

+		componentstatus.ReportStatus(e.host, componentstatus.NewRecoverableErrorEvent(err))
+		return err
+	}
+	componentstatus.ReportStatus(e.host, componentstatus.NewEvent(componentstatus.StatusOK))


this might create a lot of events reporting OK, is that really what you want? Is it there to recover from the error status?

github-actions · 2024-12-26T03:17:00Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2025-01-09T03:26:30Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

mwear requested a review from a team as a code owner October 4, 2024 22:07

mwear requested a review from codeboten October 4, 2024 22:07

mwear force-pushed the exp-status branch 3 times, most recently from 646d3dc to 84944b7 Compare October 4, 2024 22:45

codeboten added the exporter/otlp label Oct 9, 2024

bogdandrutu reviewed Oct 13, 2024

View reviewed changes

exporter/otlpexporter/otlp.go Outdated Show resolved Hide resolved

exporter/otlpexporter/otlp.go Outdated Show resolved Hide resolved

mwear force-pushed the exp-status branch 2 times, most recently from ba773f0 to 66dbf98 Compare October 18, 2024 22:24

bogdandrutu reviewed Oct 22, 2024

View reviewed changes

mwear force-pushed the exp-status branch 2 times, most recently from 94fd12c to 1a6e939 Compare October 24, 2024 00:15

github-actions bot added the Stale label Nov 12, 2024

mwear and others added 7 commits November 22, 2024 15:39

Report runtime status from otlpexporter

4d8e3a6

Report runtime status from otlphttpexporter

4a1fc30

Add changelog

5e2ba13

Lint

ed5842a

Handle all errors as recoverable

763363a

Use wrapper function instead of defer

0942592

Coverage

6a2376a

mwear force-pushed the exp-status branch from 1a6e939 to 6a2376a Compare November 22, 2024 23:42

mwear added 2 commits November 22, 2024 15:54

go mod tidy

2043eba

Lint

4557415

github-actions bot removed the Stale label Nov 23, 2024

github-actions bot added the Stale label Dec 7, 2024

atoulme removed the Stale label Dec 11, 2024

atoulme reviewed Dec 11, 2024

View reviewed changes

github-actions bot added the Stale label Dec 26, 2024

github-actions bot closed this Jan 9, 2025

mwear mentioned this pull request May 12, 2025

exporter eventually failing on retryable errors not reported in healthcheck v2 endpoint #13013

Open

[exporter/otlp] Report runtime status #11366

[exporter/otlp] Report runtime status #11366

Uh oh!

Conversation

mwear commented Oct 4, 2024

Description

Link to tracking issue

Testing

Documentation

Uh oh!

codecov bot commented Oct 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bogdandrutu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mwear commented Oct 18, 2024

Uh oh!

mwear commented Oct 18, 2024

Uh oh!

bogdandrutu commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mwear commented Oct 18, 2024

Uh oh!

mwear commented Oct 18, 2024

Uh oh!

Uh oh!

bogdandrutu Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

mwear Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 12, 2024

Uh oh!

mwear commented Nov 23, 2024

Uh oh!

github-actions bot commented Dec 7, 2024

Uh oh!

atoulme Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 26, 2024

Uh oh!

github-actions bot commented Jan 9, 2025

Uh oh!

Uh oh!

codecov bot commented Oct 4, 2024 •

edited

Loading

bogdandrutu commented Oct 18, 2024 •

edited

Loading