-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Expected Behavior
The scavenger should not perform history garbage collection on workflow that have mutable state exist in the database.
Actual Behavior
I have a case where a workflow stucks, retrieving the workflow through Temporal WebUI shows "workflow execution history not found".
Then I looking at the log and found an DataLoss error (in-contiguous event branch) logged from Frontend Server.
Then I found a tombstone for the record with the workflow runID as tree_id in the DB, indicate that Temporal executed a deletion against this workflow's history.
The log for Temporal also show "deleting history garbage" matches the time of the tombstone.
At this point, it clearly shown that Temporal was execute a history clean up against the running workflow. Please note that we can still get the mutable state from DB at this time (2 records: executions and current executions), only the history is lost.
I did some review on the component that performed the deletion: history scavenger. And found out that there are 2 safeguards for a deletion to be executed:
- history branch fork time is at least
historyScannerMinDataAge - describe mutable state returns either
serviceerror.NotFoundorserviceerror.NamespaceNotFound.
The safeguard #1 is satisfied given our workflow is quite old.
The safeguard#2 I think it's the cause. Another key point is we also have execution data cleaner enabled. So I suspect since we can still get the mutable state, the problem lies somewhere in the namespace registry. I reviewed the namespace registry and found this line can be problematic, it swallow any error returned from persistence and convert it to a NamespaceNotFound error, make safeguard #2 to be satisfied also.
Steps to Reproduce the Problem
Specifications
- Version: 1.22.0, but I think the latest code still have this issue.
- Platform: