Skip to content

Fix: Eliminate Orchestrator Orphaned Jobs #737

@0xvasanth

Description

@0xvasanth

Is there an existing issue?

  • I have searched the existing issues

Motivation

The Madara Orchestrator has a critical production vulnerability where jobs become permanently orphaned when workers crash during processing. This happens because we use JobStatus::LockedForProcessing in the database for concurrency control, mixing business logic with locking mechanisms.

Current Impact:

  • Jobs stuck in LockedForProcessing status never get processed again
  • Manual operations team intervention required to recover orphaned jobs
  • No automatic detection or recovery mechanism
  • Business continuity risk for settlement and proving pipelines
  • Production outages when critical state transitions get stuck
graph TD
    A[Job: Created] --> B[Worker Starts Processing]
    B --> C[Status: LockedForProcessing]
    C --> D{Worker Crashes?}
    D -->|No| E[Process Successfully]
    D -->|Yes| F[🚨 ORPHANED JOB]
    E --> G[Status: PendingVerification]
    F --> H[Job Stuck Forever - Manual Fix Required]

    style F fill:#ff6b6b
    style H fill:#ff6b6b
Loading

What led to this issue:
Workers crash after setting job status to LockedForProcessing but before completing the job, leaving it permanently orphaned in the database. The system has no self-healing capability to recover from this state.

Request

Problem: Replace the vulnerable JobStatus::LockedForProcessing database status with a distributed locking system that provides automatic cleanup and self-healing capabilities.

Core Requirements:

  1. Eliminate Orphaned Jobs: Zero jobs should ever be permanently stuck
  2. Self-Healing: Automatic recovery within 60 minutes maximum when workers crash
  3. Clean Architecture: Separate business logic (job states) from concurrency control (locking)
  4. Production Reliability: No manual intervention required for job recovery

Solution

Approach: Replace database status locking with external distributed locks that automatically expire.

High-Level Solution:

  1. Cache-Based Locking: Use the distributed locking system (prerequisite) with TTL-based automatic cleanup
  2. Status Cleanup: Remove LockedForProcessing and keep jobs in Created status during processing
  3. External Coordination: Handle concurrency control outside the job database using distributed locks
  4. Self-Healing: TTL ensures locks expire automatically, making jobs available for retry
graph TB
    subgraph "New Self-Healing Architecture"
        A[Job: Created] --> B[Acquire Cache Lock]
        B --> C{Lock Acquired?}
        C -->|Yes| D[Process Job - Status Stays Created]
        C -->|No| E[Skip - Another Worker Processing]
        D --> F[Status: PendingVerification]
        D --> G[Release Lock]

        subgraph "Automatic Recovery"
            H[Lock TTL: 30 minutes]
            H --> I[Worker Crashes]
            I --> J[Lock Expires Automatically]
            J --> K[Job Available for Retry]
        end
    end

    style B fill:#90EE90
    style H fill:#90EE90
    style J fill:#90EE90
Loading

Are you willing to help with this request?

Yes!

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions