Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(app): allow robot restarts to track boot ID and timeout #7589

Merged
merged 5 commits into from
Apr 14, 2021

Conversation

mcous
Copy link
Contributor

@mcous mcous commented Apr 5, 2021

Overview

This PR moves restart tracking logic from a smear across the robot-admin and buildroot modules into a single epic in the robot-admin module. The new epic:

  • Keeps existing up/down tracking behavior (watch for robots to go down and come back up)
    • This logic has a known race condition that could result in a restart being "missed"
  • Now also tracks boot ID as an additional "successful restart" avenue
    • Because we're comparing boot IDs, the race condition above does not apply
  • Calls the restart "timed out" after 60 seconds, closing the spinner
    • This will resolve and "missed" restarts and drop the user into a state with a warning modal in case there is some actual problem which caused the robot to not come back

Closes #6585

Changelog

  • feat(app): allow robot restarts to track boot ID and timeout
    • Includes general restarts, like clicking the "Restart" button (full page spinner)
    • Includes restarts initiated during a robot software update (progress animation in update modal)

Review requests

This is kind of a gnarly test matrix, but here's the behaviors we want to test. The "Restart" button on the robot page is the easiest way to trigger a restart. This works best with a non-Wi-Fi robot so you can pull the cable at any point to simulate failures.

  • Successful restart: robot reboots, spinner pops up then goes away
  • Unsuccessful restart: robot reboots, spinner pops up, robot never comes back, spinner goes away after timeout
    • Yanking the Ethernet cable after you click "Restart" can simulate this

We also need to test both of these for robot updates:

  • Successful update: robot reboots, you end up on a "robot has updated" success modal
  • Unsuccessful update: robot reboots, eventually it times out and gives you an error modal
    • To trigger: pull the Ethernet cable after you get to the "your robot is restarting" message

With those behaviors, we have two types of robots to test on:

  • v4.2.x robots and higher - expose a bootId in their update server's health endpoint
    • Inspect Redux State > robotAdmin > ${robotName} > restart > bootId to make sure its being tracked
    • Also look at the payloads of robotAdmin:RESTART_STATUS_CHANGED actions, there should be a bootId in the action that changes the status to restart-pending
  • Pre-4.2 robots do not have a bootId
    • We need to test that the restart spinner fallback logic still works

Finally:

  • The restart timeout of 60 changed to 300 seconds (5 minutes) was an arbitrary choice! Is that too short? Too long?
    • Keep an eye on how long restarts take in your testing, especially during updates

If you need refresher on what the buildroot update epics are supposed to do, check out the app cookbook which has a mostly complete outline + flowcharts of the app-side logic of the update procedure.

Risk assessment

Medium. I have not tested this on a robot yet, and it required some changes to the robot update logic. After robot testing, this PR appears to be doing its job well when combined with #7608

Mitigating factors:

  • Code is well covered by unit tests, though it's very possible I'm missing some edge cases
  • Changes to the robot update logic should only affect code that happens after the update has been sent over, verified, and committed

@mcous mcous added robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). app Affects the `app` project feature Ticket is a feature request / PR introduces a feature labels Apr 5, 2021
@mcous mcous changed the title feat(app): allow general robot restart spinner to track boot ID feat(app): allow robot restarts to track boot ID and timeout Apr 6, 2021
@mcous mcous marked this pull request as ready for review April 6, 2021 00:32
@mcous mcous requested review from a team as code owners April 6, 2021 00:32
@mcous mcous requested review from amitlissack and shlokamin and removed request for a team April 6, 2021 00:32
@mcous mcous requested a review from a team April 6, 2021 18:05
Copy link
Contributor

@amitlissack amitlissack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be looked at by JS person.

app/src/redux/buildroot/epic.js Outdated Show resolved Hide resolved
@mcous mcous added the WIP label Apr 8, 2021
@SyntaxColoring
Copy link
Contributor

SyntaxColoring commented Apr 8, 2021

Calls the restart "timed out" after 60 seconds, closing the spinner

This seems pretty tight to me. Anecdotally, I think I've seen restarts take multiple minutes, depending on whether the motor controller board needs to update its firmware, and potentially even how old the SD card is. Admittedly, though, that's measuring from power-on to API server readiness—not update server readiness.

Should we instead set the timeout very very conservatively? Like 10 minutes?

Edit: And the current "Robot is restarting..." dialog copy says "this may take up to 3 minutes," which still seems kind of tight to me.

@mcous mcous added qa QA input / review required and removed WIP labels Apr 13, 2021
@codecov
Copy link

codecov bot commented Apr 14, 2021

Codecov Report

❗ No coverage uploaded for pull request base (edge@c526b19). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             edge    #7589   +/-   ##
=======================================
  Coverage        ?   82.31%           
=======================================
  Files           ?      328           
  Lines           ?    22062           
  Branches        ?        0           
=======================================
  Hits            ?    18160           
  Misses          ?     3902           
  Partials        ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c526b19...8918e39. Read the comment docs.

@mcous mcous merged commit 3b33102 into edge Apr 14, 2021
@mcous mcous deleted the app_restart-tracking-epic branch April 14, 2021 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
app Affects the `app` project feature Ticket is a feature request / PR introduces a feature qa QA input / review required robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience).
Projects
None yet
4 participants