-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Description
Self Checks
- I have read the Contributing Guide and Language Policy.
- I have searched for existing issues search for existing issues, including closed ones.
- I confirm that I am using English to submit this report, otherwise it will be closed.
- Please do not modify this template :) and fill in all the required fields.
1. Is this request related to a challenge you're experiencing? Tell me about your story.
The current GraphExecution class uses a scalar pause_reason field to record why an execution is paused. This design has a limitation: in a multi-threaded environment where multiple worker threads execute nodes concurrently, different threads can generate NodeRunPauseRequestedEvent events independently.
Between the time the graph engine receives the first NodeRunPauseRequestedEvent and when all worker threads are stopped, additional threads may generate different pause events. The current scalar field design cannot handle multiple concurrent pause reasons, leading to potential loss of important pause event information.
The following diagram shows a potential scenario:
sequenceDiagram
participant Worker1
participant Worker2
participant GraphEngine
participant GraphExecution
Note over Worker1,Worker2: Multiple workers executing nodes concurrently
Worker1->>GraphEngine: NodeRunPauseRequestedEvent(reason1)
GraphEngine->>GraphExecution: pause(reason1)
GraphExecution->>GraphExecution: pause_reason = reason1
Note over GraphEngine: Before all workers stopped
Worker2->>GraphEngine: NodeRunPauseRequestedEvent(reason2)
GraphEngine->>GraphExecution: pause(reason2)
GraphExecution->>GraphExecution: pause_reason = reason2
Note over GraphExecution: Loss of reason1
To address this issue, we propose the following changes:
-
Convert
pause_reasontopause_reasons: Change the field from a scalar value to a list that can contain multiplePauseReasonobjects, ensuring all pause events are properly captured. -
Update related components: Modify the following components to support multiple pause reasons:
GraphExecutionState- Update the serialization model to include the new list fieldGraphExecutionProtocol- Update the protocol interface to reflect the new structureGraphRunPausedEvent- Update the event to carry multiple pause reasons
Besides, this PR also introduce a new pause_metadata field to the WorkflowPause model to store structured information about specific pause events. This field:
- Uses a JSON-serialized format to store pause-related metadata
- Has a maximum size limit of 64KB to prevent excessive database storage
- Enables better tracking and debugging of pause scenarios
2. Additional context or comments
No response
3. Can you help us with this feature?
- I am interested in contributing to this feature.