Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestration is stuck in the Running state #3022

Open
andrey-malkov opened this issue Jan 31, 2025 · 4 comments
Open

Orchestration is stuck in the Running state #3022

andrey-malkov opened this issue Jan 31, 2025 · 4 comments
Assignees
Labels

Comments

@andrey-malkov
Copy link

andrey-malkov commented Jan 31, 2025

I've already gone through the troubleshooting guide, but it doesn't provide a solution for our specific case.

Our workflow is designed to scan the backlog table and initiate sub-orchestrator workflows to process backlog items. To manage iterations efficiently, we use the eternal orchestration pattern with ContinueAsNew, preventing performance issues associated with infinite loops. Additionally, we introduce delays between iterations using the IDurableOrchestrationContext.CreateTimer method.

using var pullingJobCts = new CancellationTokenSource();
await context.CreateTimer(context.CurrentUtcDateTime.Add(TimeSpan.FromMinutes(1)), pullingJobCts.Token);

We've been experiencing this issue for quite some time. Over the past two weeks, out of thousands of executions, approximately a dozen workflow instances have become stuck. Using the VS Code extension, I can see that the last recorded operation for these stuck workflows is TimerCreated.

I was finally able to capture messages for these instances in the control queue

Instance ID				Message ID				Deqeue count
38bf0b8588db420e8c5b992d47a8e735:1705	b7bde336-9125-4859-8dd7-d259dd0d4204	4549 
58dbd6970a4e4d8895a03af0a8fa1fad:998	bba78a46-7222-483a-87e1-74f76d0e17df 	6944 

their dequeue count values suggest that after the delay, the expected message to resume computation was never received, since 17 and 22 Jan. I've attached the messages body. You can see the event type is TimerFiredEvent.

b7bde336-9125-4859-8dd7-d259dd0d4204.json
bba78a46-7222-483a-87e1-74f76d0e17df.json

In the logs I found errors

2025-01-31T21:45:32Z [Error] An unexpected failure occurred while processing instance '58dbd6970a4e4d8895a03af0a8fa1fad:998': DurableTask.AzureStorage.Storage.DurableTaskStorageException: An error occurred while communicating with Azure Storage
---> Azure.RequestFailedException: The specified blob does not exist.

error-38bf0b8588db420e8c5b992d47a8e735-1705.txt
error-58dbd6970a4e4d8895a03af0a8fa1fad-998.txt

Thanks,
Andrei

@AnatoliB
Copy link
Collaborator

AnatoliB commented Feb 3, 2025

Seems to be related to Azure/durabletask#802

@AnatoliB AnatoliB self-assigned this Feb 3, 2025
@AnatoliB AnatoliB added the P1 Priority 1 label Feb 3, 2025
@AnatoliB
Copy link
Collaborator

AnatoliB commented Feb 3, 2025

@andrey-malkov While we investigate the root cause, would you be able to apply the workaround mentioned at Azure/durabletask#802 (comment)?

We solved our problem by replacing our calls to context.ContinueAsNew by context.StartNewOrchestration, after which no new failing instances were reported.

@andrey-malkov
Copy link
Author

andrey-malkov commented Feb 3, 2025

@AnatoliB yes I can try this approach, thank you.

We have two stuck instances in the green stage, and today got 5 new ones in the active blue stage. Please let me know if you need the app names, probably you can collect some additional info about the issue.

@andrey-malkov
Copy link
Author

@AnatoliB I can't switch the implementation to StartNewOrchestration, by design we need ContinueAsNew

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants