Terminate/forget timed out activities or sub-orchestrations #3209
Unanswered
firedigger
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have a background process as an eternal orchestration that performs a task (a backup) for all users in the list.
The number of users can be large (up to 10s of thousands).
To optimize for the number of replays and concurrency, we batch users in a
Task.WhenAll
in a loop.However, some user backup in theory can get stuck and halt the whole orchestration process (for days).
I considered that if I could do
Task.WhenAny
with a timer and proceed further (skip unfinished timed out work items). In practice, that didn't seem to help, I would still get stuck orchestrations. I have also tried to put a timeout on a sub-orchestrator altogether, but that I think explicitly is not supposed to work (the orchestration will wait for sub-orchestrations).And, if I understand correctly, if I set
functionTimeout
to a value (currently I run on-1
app service plan), it will nuke the orchestration.So my main question here - is there a reasonable way to implement the expected behaviour? I considered it would be possible to do within the rich scheduling framework of azure functions, but if not, I would have to implement that in the actual activity functions (though it might not be super reliable if the actual code there gets stuck for real).
My secondary question here, with the introduction of
extendedSessions
parameter for isolated functions (which I use), will theTask.WhenAny
approach to batches instead ofTask.WhenAll
work well? Previously it would be super slow on large user lists due to replays (which spend time on reading the big users list from blob), but if I can avoid those frequent replays, I could have a list of constantly running users backups simultaneously, and the stuck user will at least not halt the progress of the batches, but would be great, if I could at least restart the orchestration in its entirety in that case. I have been also considering a separateTimerTrigger
to find "stuck" orchestrations and restarting them that way (terminate + schedule). Of course I would prefer something more native to the orchestration flow.Sorry for the loaded question, and I appreciate any advice on this (including
host.json
settings I should look into, I have tried quite a few, but found out changingmaxConcurrenctActivities
was not a good idea as the default is per core and reasonable). I imagined my fan-out case of big items lists to process was a very common scenario of azure function use cases, but the replay on await interaction with concurrency mechanisms and large blob reading for persistent payload are affecting performance as I have thousands of tenants to process (I use app service plan cores auto-scaling based on CPU and message queue length so hardware shouldn't be an issue).Beta Was this translation helpful? Give feedback.
All reactions