fixing startup deadlock #11142

brettsam · 2025-06-20T15:12:03Z

Issue describing the changes in this PR

resolves #10766

Pull request checklist

IMPORTANT: Currently, changes must be backported to the in-proc branch to be included in Core Tools and non-Flex deployments.

Backporting to the in-proc branch is not required
- Otherwise: Link to backporting PR
My changes do not require documentation changes
- Otherwise: Documentation issue linked to PR
My changes should not be added to the release notes for the next release
- Otherwise: I've added my notes to release_notes.md
My changes do not need to be backported to a previous version
- Otherwise: Backport tracked by issue/PR #issue_or_pr
My changes do not require diagnostic events changes
- Otherwise: I have added/updated all related diagnostic events and their documentation (Documentation issue linked to PR)
I have added all required tests (Unit tests, E2E tests)

Additional information

The problem here (outlined in the issue) is that:

Even in transient error situations, we were setting the State to Error.
If we tried to recover, the WorkerFunctionMetadataProvider would look at that state, think that the host was in a stable state without any channels, and issue a restart here:

azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs

Line 99 in 7173879

await _scriptHostManager.RestartHostAsync();
That restart would deadlock waiting for the current startup to complete... but it was part of the current startup so it'd never complete.
Deadlock.

After looking through all the code, I've concluded that:

~~Error and Offline (and to some extent Running) are considered "terminal" events... that's the state things will remain in. We have code that immediately returns if it sees Error, etc.~~
~~However, during a transient error, this isn't the case. We are recovering and need to communicate that somehow.~~
~~In this specific case, it's untrue that the ScriptHost is in an Error state. We are retrying fresh, so the metadata manager needs to know that.~~
~~By resetting to Default, it communicates that we're still starting up.~~
Yes, this may go on forever as we'll back off but retry until the platform stops us. However, the places that are waiting on this also have timeouts applied, so they should be unaffected. And in fact, they should behave better under transient conditions.

My original attempt had added a HandlingHostError state that, while did communicate correctly, change the assumptions across other services. A handful of tests failed.

After seeing tests fail b/c they assumed Error, I've changed the approach to simply starting worker channels also while in the Error state. This seems fine the more I look at it -- we are performing a normal startup (or else these services wouldn't start), so we should behave as such. Every other service behaves normally here and we just happen to have this unique piece of code for an odd situation where we need to recover.

Other considerations:

What we really should do is make sure that when RestartAsync() is called, the CancellationToken we call will flow all the way down, effectively cancelling itself. However, that would be a large refactoring of code that is already being completely rewritten. I consider this a strategic fix to stop the issues we're currently seeing and the next iteration of worker management will take care of this scenario in a cleaner way.

test/WebJobs.Script.Tests.Integration/WebHostEndToEnd/WebHostStartupEndToEndTests.cs

src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs

brettsam · 2025-06-30T16:18:17Z

@cjaliaga / @mathewc / @RohitRanjanMS -- this is ready for another look after a refactor to instead rely on the host lifetime to check whether it's completed startup or not

brettsam · 2025-06-30T20:14:35Z

@mathewc -- merging but will follow-up if you have any further suggestions

brettsam requested a review from a team as a code owner June 20, 2025 15:12

brettsam force-pushed the brettsam/race branch from 3ba97ef to ce9b448 Compare June 20, 2025 15:43

brettsam commented Jun 20, 2025

View reviewed changes

test/WebJobs.Script.Tests.Integration/WebHostEndToEnd/WebHostStartupEndToEndTests.cs Show resolved Hide resolved

brettsam force-pushed the brettsam/race branch from 816c17e to 234c0f8 Compare June 25, 2025 14:15

fixing startup deadlock with transient errors

74a300d

brettsam force-pushed the brettsam/race branch from be26627 to 74a300d Compare June 25, 2025 21:05

cjaliaga approved these changes Jun 26, 2025

View reviewed changes

RohitRanjanMS approved these changes Jun 26, 2025

View reviewed changes

mathewc reviewed Jun 26, 2025

View reviewed changes

src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs Outdated Show resolved Hide resolved

brettsam added 3 commits June 30, 2025 07:25

refactor of fix

d018787

whoops

4755bbe

improvement

65dd15c

RohitRanjanMS self-requested a review June 30, 2025 17:46

RohitRanjanMS approved these changes Jun 30, 2025

View reviewed changes

liliankasem approved these changes Jun 30, 2025

View reviewed changes

brettsam merged commit 0e0cadb into dev Jun 30, 2025
9 checks passed

brettsam deleted the brettsam/race branch June 30, 2025 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fixing startup deadlock #11142

fixing startup deadlock #11142

Uh oh!

brettsam commented Jun 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

brettsam commented Jun 30, 2025

Uh oh!

brettsam commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

fixing startup deadlock #11142

fixing startup deadlock #11142

Uh oh!

Conversation

brettsam commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue describing the changes in this PR

Pull request checklist

Additional information

Uh oh!

Uh oh!

Uh oh!

brettsam commented Jun 30, 2025

Uh oh!

brettsam commented Jun 30, 2025

Uh oh!

Uh oh!

Uh oh!

brettsam commented Jun 20, 2025 •

edited

Loading