Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azure-eventhub] Update input v1 status on start, failure, and stop #41469

Merged
merged 3 commits into from
Nov 7, 2024

Conversation

zmoog
Copy link
Contributor

@zmoog zmoog commented Oct 28, 2024

Proposed commit message

Update the Elastic Agent status by calling inputContext.UpdateStatus(status.Failed, err.Error()) during the main input lifecycle phases (set up and run). If any of the setup, startup, and run steps fail, the input reports the fatal issue before shutting down.

Without reporting the fatal error, the input logs the error and stops, but users continue to see it as "healthy" in Fleet, causing confusion and making troubleshooting much harder.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Test with wrong event hub credentials (for example, use a connection string from a different namespace)
  • Test with wrong storage account credentials (for example, a key from a different storage account or a random string)

How to test this PR locally

  • Build a custom agent
  • Install the Azure Logs integration
    • set an invalid connection string to test setup() failures
    • set an invalid storage account key to test run() failures

Related issues

Screenshots

Fatal error during setup() caused by an invalid event hub connection string:

CleanShot 2024-11-06 at 22 58 15@2x

Fatal error during run() caused by an invalid storage account key:

CleanShot 2024-11-06 at 19 37 33@2x

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 28, 2024
@mergify mergify bot assigned zmoog Oct 28, 2024
Copy link
Contributor

mergify bot commented Oct 28, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @zmoog? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 28, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 28, 2024
@@ -105,9 +109,11 @@ func (in *eventHubInputV1) Run(
err = in.run(ctx)
if err != nil {
in.log.Errorw("error running input", "error", err)
inputContext.UpdateStatus(status.Failed, err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @faec, I missed the opportunity to report the input status to the Elastic Agent when I updated the input to the Input API v2.

I am trying to address scenarios (like elastic/integrations#9659) where the input starts, and after a while, it encounters a fatal error, and the SDK worker shuts down.

What are your recommendations for reporting input status to the agent? Which input should I use as a reference?

🙇

@zmoog zmoog added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Oct 28, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 28, 2024
@zmoog zmoog added needs_team Indicates that the issue/PR needs a Team:* label bugfix labels Oct 28, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 28, 2024
@zmoog zmoog added needs_team Indicates that the issue/PR needs a Team:* label input:azure-eventhub labels Oct 28, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 28, 2024
@zmoog zmoog force-pushed the zmoog/azure-eventhub-input-status branch from f112711 to 9412dd2 Compare November 6, 2024 10:29
When I update the input to the Input API v2, I missed the opportunity
to report the input status back to the Elastic Agent.
Now also cover the following phases:

- pipeline creation
- sanitizers creation
- input setup
@zmoog zmoog force-pushed the zmoog/azure-eventhub-input-status branch from 9412dd2 to c631779 Compare November 6, 2024 22:23
@zmoog zmoog marked this pull request as ready for review November 6, 2024 22:24
@zmoog zmoog requested a review from a team as a code owner November 6, 2024 22:24
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@zmoog
Copy link
Contributor Author

zmoog commented Nov 6, 2024

Tested on a custom build of the Elastic Agent. We set an invalid event hub connection string and an invalid storage account key to trigger fatal errors during the input setup and run phases.

@zmoog zmoog added backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify labels Nov 6, 2024
@zmoog zmoog enabled auto-merge (squash) November 6, 2024 23:44
@zmoog zmoog merged commit 882c854 into main Nov 7, 2024
22 checks passed
@zmoog zmoog deleted the zmoog/azure-eventhub-input-status branch November 7, 2024 00:36
mergify bot pushed a commit that referenced this pull request Nov 7, 2024
…41469)

Update the Elastic Agent status by calling `inputContext.UpdateStatus(status.Failed, err.Error())` during the main input lifecycle phases (set up and run). If any setup, startup, and run steps fail, the input reports the fatal issue before shutting down.

Without reporting the fatal error, the input logs the error and stops, but users continue to see it as "healthy" in Fleet, causing confusion and making troubleshooting much harder.

(cherry picked from commit 882c854)
mergify bot pushed a commit that referenced this pull request Nov 7, 2024
…41469)

Update the Elastic Agent status by calling `inputContext.UpdateStatus(status.Failed, err.Error())` during the main input lifecycle phases (set up and run). If any setup, startup, and run steps fail, the input reports the fatal issue before shutting down.

Without reporting the fatal error, the input logs the error and stops, but users continue to see it as "healthy" in Fleet, causing confusion and making troubleshooting much harder.

(cherry picked from commit 882c854)
mergify bot pushed a commit that referenced this pull request Nov 7, 2024
…41469)

Update the Elastic Agent status by calling `inputContext.UpdateStatus(status.Failed, err.Error())` during the main input lifecycle phases (set up and run). If any setup, startup, and run steps fail, the input reports the fatal issue before shutting down.

Without reporting the fatal error, the input logs the error and stops, but users continue to see it as "healthy" in Fleet, causing confusion and making troubleshooting much harder.

(cherry picked from commit 882c854)
zmoog added a commit that referenced this pull request Nov 7, 2024
…41469) (#41547)

Update the Elastic Agent status by calling `inputContext.UpdateStatus(status.Failed, err.Error())` during the main input lifecycle phases (set up and run). If any setup, startup, and run steps fail, the input reports the fatal issue before shutting down.

Without reporting the fatal error, the input logs the error and stops, but users continue to see it as "healthy" in Fleet, causing confusion and making troubleshooting much harder.

(cherry picked from commit 882c854)

Co-authored-by: Maurizio Branca <[email protected]>
zmoog added a commit that referenced this pull request Nov 7, 2024
…41469) (#41546)

Update the Elastic Agent status by calling `inputContext.UpdateStatus(status.Failed, err.Error())` during the main input lifecycle phases (set up and run). If any setup, startup, and run steps fail, the input reports the fatal issue before shutting down.

Without reporting the fatal error, the input logs the error and stops, but users continue to see it as "healthy" in Fleet, causing confusion and making troubleshooting much harder.

(cherry picked from commit 882c854)

Co-authored-by: Maurizio Branca <[email protected]>
zmoog added a commit that referenced this pull request Nov 7, 2024
…art, failure, and stop (#41545)

* [azure-eventhub] Update input v1 status on start, failure, and stop (#41469)

Update the Elastic Agent status by calling `inputContext.UpdateStatus(status.Failed, err.Error())` during the main input lifecycle phases (set up and run). If any setup, startup, and run steps fail, the input reports the fatal issue before shutting down.

Without reporting the fatal error, the input logs the error and stops, but users continue to see it as "healthy" in Fleet, causing confusion and making troubleshooting much harder.

(cherry picked from commit 882c854)

* Drop unintentional changelog entries

---------

Co-authored-by: Maurizio Branca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify bugfix input:azure-eventhub Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[azure-eventhub] Agent Reports Healthy Status After Input Error Post-Startup
4 participants