Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue describing the changes in this PR
resolves #10766
Pull request checklist
IMPORTANT: Currently, changes must be backported to the
in-proc
branch to be included in Core Tools and non-Flex deployments.in-proc
branch is not requiredrelease_notes.md
Additional information
The problem here (outlined in the issue) is that:
State
toError
.WorkerFunctionMetadataProvider
would look at that state, think that the host was in a stable state without any channels, and issue a restart here:azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs
Line 99 in 7173879
After looking through all the code, I've concluded that:
Error
andOffline
(and to some extentRunning
) are considered "terminal" events... that's the state things will remain in. We have code that immediately returns if it seesError
, etc.However, during a transient error, this isn't the case. We are recovering and need to communicate that somehow.In this specific case, it's untrue that theScriptHost
is in anError
state. We are retrying fresh, so the metadata manager needs to know that.By resetting toDefault
, it communicates that we're still starting up.Yes, this may go on forever as we'll back off but retry until the platform stops us. However, the places that are waiting on this also have timeouts applied, so they should be unaffected. And in fact, they should behave better under transient conditions.My original attempt had added a
HandlingHostError
state that, while did communicate correctly, change the assumptions across other services. A handful of tests failed.After seeing tests fail b/c they assumed
Error
, I've changed the approach to simply starting worker channels also while in theError
state. This seems fine the more I look at it -- we are performing a normal startup (or else these services wouldn't start), so we should behave as such. Every other service behaves normally here and we just happen to have this unique piece of code for an odd situation where we need to recover.Other considerations:
RestartAsync()
is called, theCancellationToken
we call will flow all the way down, effectively cancelling itself. However, that would be a large refactoring of code that is already being completely rewritten. I consider this a strategic fix to stop the issues we're currently seeing and the next iteration of worker management will take care of this scenario in a cleaner way.