Terminate/forget timed out activities or sub-orchestrations #3209

firedigger · 2025-09-29T08:29:18Z

firedigger
Sep 29, 2025

We have a background process as an eternal orchestration that performs a task (a backup) for all users in the list.
The number of users can be large (up to 10s of thousands).
To optimize for the number of replays and concurrency, we batch users in a Task.WhenAll in a loop.
However, some user backup in theory can get stuck and halt the whole orchestration process (for days).
I considered that if I could do Task.WhenAny with a timer and proceed further (skip unfinished timed out work items). In practice, that didn't seem to help, I would still get stuck orchestrations. I have also tried to put a timeout on a sub-orchestrator altogether, but that I think explicitly is not supposed to work (the orchestration will wait for sub-orchestrations).
And, if I understand correctly, if I set functionTimeout to a value (currently I run on -1 app service plan), it will nuke the orchestration.
So my main question here - is there a reasonable way to implement the expected behaviour? I considered it would be possible to do within the rich scheduling framework of azure functions, but if not, I would have to implement that in the actual activity functions (though it might not be super reliable if the actual code there gets stuck for real).
My secondary question here, with the introduction of extendedSessions parameter for isolated functions (which I use), will the Task.WhenAny approach to batches instead of Task.WhenAll work well? Previously it would be super slow on large user lists due to replays (which spend time on reading the big users list from blob), but if I can avoid those frequent replays, I could have a list of constantly running users backups simultaneously, and the stuck user will at least not halt the progress of the batches, but would be great, if I could at least restart the orchestration in its entirety in that case. I have been also considering a separate TimerTrigger to find "stuck" orchestrations and restarting them that way (terminate + schedule). Of course I would prefer something more native to the orchestration flow.
Sorry for the loaded question, and I appreciate any advice on this (including host.json settings I should look into, I have tried quite a few, but found out changing maxConcurrenctActivities was not a good idea as the default is per core and reasonable). I imagined my fan-out case of big items lists to process was a very common scenario of azure function use cases, but the replay on await interaction with concurrency mechanisms and large blob reading for persistent payload are affecting performance as I have thousands of tenants to process (I use app service plan cores auto-scaling based on CPU and message queue length so hardware shouldn't be an issue).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Terminate/forget timed out activities or sub-orchestrations #3209

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Terminate/forget timed out activities or sub-orchestrations #3209

Uh oh!

Uh oh!

firedigger Sep 29, 2025

Replies: 0 comments

firedigger
Sep 29, 2025