Skip to content

Enhanced GraphEngine Pause Handling #28195

@QuantumGhost

Description

@QuantumGhost

Self Checks

  • I have read the Contributing Guide and Language Policy.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report, otherwise it will be closed.
  • Please do not modify this template :) and fill in all the required fields.

1. Is this request related to a challenge you're experiencing? Tell me about your story.

The current GraphExecution class uses a scalar pause_reason field to record why an execution is paused. This design has a limitation: in a multi-threaded environment where multiple worker threads execute nodes concurrently, different threads can generate NodeRunPauseRequestedEvent events independently.

Between the time the graph engine receives the first NodeRunPauseRequestedEvent and when all worker threads are stopped, additional threads may generate different pause events. The current scalar field design cannot handle multiple concurrent pause reasons, leading to potential loss of important pause event information.

The following diagram shows a potential scenario:

sequenceDiagram
    participant Worker1
    participant Worker2
    participant GraphEngine
    participant GraphExecution

    Note over Worker1,Worker2: Multiple workers executing nodes concurrently
    
    Worker1->>GraphEngine: NodeRunPauseRequestedEvent(reason1)
    GraphEngine->>GraphExecution: pause(reason1)
    GraphExecution->>GraphExecution: pause_reason = reason1
    
    Note over GraphEngine: Before all workers stopped
    
    Worker2->>GraphEngine: NodeRunPauseRequestedEvent(reason2)
    GraphEngine->>GraphExecution: pause(reason2)
    GraphExecution->>GraphExecution: pause_reason = reason2
    
    Note over GraphExecution: Loss of reason1
Loading

To address this issue, we propose the following changes:

  1. Convert pause_reason to pause_reasons: Change the field from a scalar value to a list that can contain multiple PauseReason objects, ensuring all pause events are properly captured.

  2. Update related components: Modify the following components to support multiple pause reasons:

    • GraphExecutionState - Update the serialization model to include the new list field
    • GraphExecutionProtocol - Update the protocol interface to reflect the new structure
    • GraphRunPausedEvent - Update the event to carry multiple pause reasons

Besides, this PR also introduce a new pause_metadata field to the WorkflowPause model to store structured information about specific pause events. This field:

  • Uses a JSON-serialized format to store pause-related metadata
  • Has a maximum size limit of 64KB to prevent excessive database storage
  • Enables better tracking and debugging of pause scenarios

2. Additional context or comments

No response

3. Can you help us with this feature?

  • I am interested in contributing to this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions