Skip to content

fixing startup deadlock #11142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 30, 2025
Merged

fixing startup deadlock #11142

merged 4 commits into from
Jun 30, 2025

Conversation

brettsam
Copy link
Member

@brettsam brettsam commented Jun 20, 2025

Issue describing the changes in this PR

resolves #10766

Pull request checklist

IMPORTANT: Currently, changes must be backported to the in-proc branch to be included in Core Tools and non-Flex deployments.

  • Backporting to the in-proc branch is not required
    • Otherwise: Link to backporting PR
  • My changes do not require documentation changes
    • Otherwise: Documentation issue linked to PR
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • My changes do not require diagnostic events changes
    • Otherwise: I have added/updated all related diagnostic events and their documentation (Documentation issue linked to PR)
  • I have added all required tests (Unit tests, E2E tests)

Additional information

The problem here (outlined in the issue) is that:

  1. Even in transient error situations, we were setting the State to Error.
  2. If we tried to recover, the WorkerFunctionMetadataProvider would look at that state, think that the host was in a stable state without any channels, and issue a restart here:
    await _scriptHostManager.RestartHostAsync();
  3. That restart would deadlock waiting for the current startup to complete... but it was part of the current startup so it'd never complete.
  4. Deadlock.

After looking through all the code, I've concluded that:

  • Error and Offline (and to some extent Running) are considered "terminal" events... that's the state things will remain in. We have code that immediately returns if it sees Error, etc.
  • However, during a transient error, this isn't the case. We are recovering and need to communicate that somehow.
  • In this specific case, it's untrue that the ScriptHost is in an Error state. We are retrying fresh, so the metadata manager needs to know that.
  • By resetting to Default, it communicates that we're still starting up.
  • Yes, this may go on forever as we'll back off but retry until the platform stops us. However, the places that are waiting on this also have timeouts applied, so they should be unaffected. And in fact, they should behave better under transient conditions.

My original attempt had added a HandlingHostError state that, while did communicate correctly, change the assumptions across other services. A handful of tests failed.

After seeing tests fail b/c they assumed Error, I've changed the approach to simply starting worker channels also while in the Error state. This seems fine the more I look at it -- we are performing a normal startup (or else these services wouldn't start), so we should behave as such. Every other service behaves normally here and we just happen to have this unique piece of code for an odd situation where we need to recover.

Other considerations:

  • What we really should do is make sure that when RestartAsync() is called, the CancellationToken we call will flow all the way down, effectively cancelling itself. However, that would be a large refactoring of code that is already being completely rewritten. I consider this a strategic fix to stop the issues we're currently seeing and the next iteration of worker management will take care of this scenario in a cleaner way.

@brettsam brettsam requested a review from a team as a code owner June 20, 2025 15:12
@brettsam
Copy link
Member Author

@cjaliaga / @mathewc / @RohitRanjanMS -- this is ready for another look after a refactor to instead rely on the host lifetime to check whether it's completed startup or not

@RohitRanjanMS RohitRanjanMS self-requested a review June 30, 2025 17:46
@brettsam
Copy link
Member Author

@mathewc -- merging but will follow-up if you have any further suggestions

@brettsam brettsam merged commit 0e0cadb into dev Jun 30, 2025
9 checks passed
@brettsam brettsam deleted the brettsam/race branch June 30, 2025 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error during host startup can cause a deadlock in the restart flow, leaving the host unhealthy until a manual restart
5 participants