Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't fail checkin if agent has upgrade details with no action #3991

Merged
merged 5 commits into from
Oct 23, 2024

Conversation

jillguyonnet
Copy link
Contributor

@jillguyonnet jillguyonnet commented Oct 10, 2024

What is the problem this PR solves?

Relates https://github.com/elastic/ingest-dev/issues/4223

We're investigating an issue in a serverless project where Fleet Server keeps throwing 500 on agent checkin because the agent has upgrade details but the action id they refer to is not found.

How does this PR solve the problem?

Modify Checkin.processUpgradeDetails so that checkin does not fail in this scenario and log a warning instead.

@jillguyonnet jillguyonnet added the Team:Fleet Label for the Fleet team label Oct 10, 2024
@jillguyonnet jillguyonnet self-assigned this Oct 10, 2024
@jillguyonnet jillguyonnet requested a review from a team as a code owner October 10, 2024 14:22
Copy link
Contributor

mergify bot commented Oct 10, 2024

This pull request does not have a backport label. Could you fix it @jillguyonnet? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 10, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 10, 2024
@jlind23
Copy link
Contributor

jlind23 commented Oct 10, 2024

After chatting about this with @cmacknz we would like to change this behaviour and not fail nor even return a 5xx whenever this is happening. At most, a warning should be logged but that would be it.

@jillguyonnet
Copy link
Contributor Author

Thanks @jlind23 - I'm not sure how to test this properly yet but I pushed a commit as it seems like a minimal change.

@jlind23
Copy link
Contributor

jlind23 commented Oct 10, 2024

I would like at least to get @michel-laterman's eyes on this before merging.

Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a changelog entry for this.

Also we can add an e2e test, we can enrol the agent then checkin with upgrade_details where we specify the action id

@cmacknz
Copy link
Member

cmacknz commented Oct 10, 2024

After chatting about this with @cmacknz we would like to change this behaviour and not fail nor even return a 5xx whenever this is happening. At most, a warning should be logged but that would be it.

To add some context for why I suggested this, if you run elastic-agent upgrade for a Fleet managed agent there will be no action, and then all sub-sequent check in will fail with 500s.

We need the system to keep operating normally if this happens, this should not be a way to block agent from successfully checking in.

Taking a larger step back, if an agent says it is updating, and Fleet doesn't expect that, the agent should probably be flagged with an "unexpected update" status or something in the Fleet UI. Most users are not going to look in the Fleet Server logs to see things like this.

We could also arguably prevent using elastic-agent upgrade at all for Fleet managed upgrades, but the system would still have to tolerate this "surprise upgrade details" situation well.

@jillguyonnet
Copy link
Contributor Author

We should add a changelog entry for this.

Also we can add an e2e test, we can enrol the agent then checkin with upgrade_details where we specify the action id

Thanks @michel-laterman - I've added a changelog entry. I've tried adding an e2e test but running into issues so far. I'll reach out to you about this.

@jillguyonnet
Copy link
Contributor Author

Thank you @cmacknz for the added context, it makes definite sense that a non existent agent action should not cause the checkin to fail.

On the serverless project where the issue is happening, the agentless agent version was updated as part of a new k8s deployment running a newer version of the agent. IIUC, because this is not a Fleet-managed upgrade process, it could have resulted in agents with upgrade details but no agent actions.

I'm waiting to check the proposed fix implementation with @michel-laterman.

It sounds like it would be worth capturing your last observation (handling non Fleet managed upgrades) somewhere else, perhaps in an ingest-dev issue?

@swiatekm swiatekm removed their request for review October 14, 2024 15:38
@jillguyonnet jillguyonnet changed the title Improve error message on Checkin.processUpgradeDetails if action not … Don't fail checkin if agent has upgrade details with no action Oct 15, 2024
@cmacknz
Copy link
Member

cmacknz commented Oct 15, 2024

On the serverless project where the issue is happening, the agentless agent version was updated as part of a new k8s deployment running a newer version of the agent. IIUC, because this is not a Fleet-managed upgrade process, it could have resulted in agents with upgrade details but no agent actions.

Containers don't trigger the upgrade process, and you can't send upgrade actions to agents in containers because they report themselves (or should) as non-upgradable. You can exec into a container and run elastic-agent upgrade though which may be where this came from.

I am a little bit hesitant to start forbidding use of the elastic-agent upgrade completely for Fleet managed agents as it is a useful escape hatch if something goes terribly wrong. For now I've written elastic/kibana#196426 as an enhancement on the Fleet side to flag agents that are upgrading unexpectedly.

@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Oct 21, 2024
@jillguyonnet
Copy link
Contributor Author

I'm using the CI to run my test as e2e tests for the time being are they are not running properly locally, this might take a few commits.

@jillguyonnet jillguyonnet marked this pull request as draft October 22, 2024 10:07
@jillguyonnet jillguyonnet marked this pull request as ready for review October 22, 2024 13:46
Copy link
Contributor

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

thanks for adding the e2e test!

@jillguyonnet jillguyonnet merged commit 9dc43b0 into elastic:main Oct 23, 2024
8 checks passed
@jillguyonnet jillguyonnet deleted the improve-checkin-error-message branch October 23, 2024 08:38
mergify bot pushed a commit that referenced this pull request Oct 23, 2024
Don't fail checkin if agent has upgrade details with no action

(cherry picked from commit 9dc43b0)
@pierrehilbert
Copy link
Contributor

Should we have this fix also in 8.16?
cc @jillguyonnet

@jillguyonnet
Copy link
Contributor Author

@pierrehilbert Yes, it would make sense. Does the backport-8.x label not cover it?

@pierrehilbert
Copy link
Contributor

No, we already created the 8.16 branch, I'm adding the backport to 8.16, thanks for confirming :-)

@pierrehilbert pierrehilbert added the backport-8.16 Automated backport with mergify label Oct 23, 2024
mergify bot pushed a commit that referenced this pull request Oct 23, 2024
Don't fail checkin if agent has upgrade details with no action

(cherry picked from commit 9dc43b0)
pierrehilbert pushed a commit that referenced this pull request Oct 23, 2024
#4035)

Don't fail checkin if agent has upgrade details with no action

(cherry picked from commit 9dc43b0)

Co-authored-by: Jill Guyonnet <[email protected]>
ycombinator pushed a commit that referenced this pull request Oct 30, 2024
Don't fail checkin if agent has upgrade details with no action

(cherry picked from commit 9dc43b0)
jillguyonnet added a commit that referenced this pull request Nov 4, 2024
#4034)

Don't fail checkin if agent has upgrade details with no action

(cherry picked from commit 9dc43b0)

Co-authored-by: Jill Guyonnet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants