Skip to content

Comments

Eliminate potential race conditions while rebooting#631

Merged
cbgbt merged 4 commits intobottlerocket-os:developfrom
cbgbt:reboot-racer
Jun 26, 2024
Merged

Eliminate potential race conditions while rebooting#631
cbgbt merged 4 commits intobottlerocket-os:developfrom
cbgbt:reboot-racer

Conversation

@cbgbt
Copy link
Contributor

@cbgbt cbgbt commented Jun 25, 2024

Issue number:
Closes #630

Description of changes:
This PR best left in commit-order, as it also performs several maintenance chores.

    agent: eliminate potential races while rebooting
    
    The agent had a few conditions under which it can enter the
    MonitoringUpdate state before it successfully uncordons the node:
    
    * Brupop updates its state into Rebooted before the reboot terminates
      the process
    * After rebooting, an error ocurrs after updating the state but before
      performing the uncordon.
    
    We avoid this conditions by:
    * Exiting the agent process in cases where the desired state is Rebooted
      but we are not yet running the new version.
    * Always uncordoning the node before marking that we have successfully
      transitioned into the Rebooted state.
    * Defensively uncordoning the node once again when we enter the
      Monitoring state.

Testing done:
I built a custom Bottlerocket variant which always waits for a minute before performing a reboot, then used this to test these changes. In doing this, I was able to reliably trigger the race condition, and then also confirm that this new code does not misbehave when reboots occur slowly.

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

cbgbt added 4 commits June 25, 2024 23:19
Several keys have been deprecated in favor of the same default values we
were specifying before. By removing them and adding `version = 2` under
licenses, we will maintain the same behavior but without the deprecation
notices that we being printed.
The agent had a few conditions under which it can enter the
MonitoringUpdate state before it successfully uncordons the node:

* Brupop updates its state into Rebooted before the reboot terminates
  the process
* After rebooting, an error ocurrs after updating the state but before
  performing the uncordon.

We avoid this conditions by:
* Exiting the agent process in cases where the desired state is Rebooted
  but we are not yet running the new version.
* Always uncordoning the node before marking that we have successfully
  transitioned into the Rebooted state.
* Defensively uncordoning the node once again when we enter the
  Monitoring state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

brupop agent (sometimes) leaving nodes unschedulable

3 participants