Fix: Eliminate Orchestrator Orphaned Jobs

### Is there an existing issue?

- [x] I have searched the existing issues

### Motivation


The Madara Orchestrator has a critical production vulnerability where jobs become permanently orphaned when workers crash during processing. This happens because we use `JobStatus::LockedForProcessing` in the database for concurrency control, mixing business logic with locking mechanisms.

**Current Impact:**
- Jobs stuck in `LockedForProcessing` status never get processed again
- Manual operations team intervention required to recover orphaned jobs
- No automatic detection or recovery mechanism
- Business continuity risk for settlement and proving pipelines
- Production outages when critical state transitions get stuck

```mermaid
graph TD
    A[Job: Created] --> B[Worker Starts Processing]
    B --> C[Status: LockedForProcessing]
    C --> D{Worker Crashes?}
    D -->|No| E[Process Successfully]
    D -->|Yes| F[🚨 ORPHANED JOB]
    E --> G[Status: PendingVerification]
    F --> H[Job Stuck Forever - Manual Fix Required]

    style F fill:#ff6b6b
    style H fill:#ff6b6b
```

**What led to this issue:**
Workers crash after setting job status to `LockedForProcessing` but before completing the job, leaving it permanently orphaned in the database. The system has no self-healing capability to recover from this state.


### Request


**Problem:** Replace the vulnerable `JobStatus::LockedForProcessing` database status with a distributed locking system that provides automatic cleanup and self-healing capabilities.

**Core Requirements:**
1. **Eliminate Orphaned Jobs**: Zero jobs should ever be permanently stuck
2. **Self-Healing**: Automatic recovery within 60 minutes maximum when workers crash
3. **Clean Architecture**: Separate business logic (job states) from concurrency control (locking)
4. **Production Reliability**: No manual intervention required for job recovery


### Solution


**Approach:** Replace database status locking with external distributed locks that automatically expire.

**High-Level Solution:**
1. **Cache-Based Locking**: Use the distributed locking system (prerequisite) with TTL-based automatic cleanup
2. **Status Cleanup**: Remove `LockedForProcessing` and keep jobs in `Created` status during processing
3. **External Coordination**: Handle concurrency control outside the job database using distributed locks
4. **Self-Healing**: TTL ensures locks expire automatically, making jobs available for retry


```mermaid
graph TB
    subgraph "New Self-Healing Architecture"
        A[Job: Created] --> B[Acquire Cache Lock]
        B --> C{Lock Acquired?}
        C -->|Yes| D[Process Job - Status Stays Created]
        C -->|No| E[Skip - Another Worker Processing]
        D --> F[Status: PendingVerification]
        D --> G[Release Lock]

        subgraph "Automatic Recovery"
            H[Lock TTL: 30 minutes]
            H --> I[Worker Crashes]
            I --> J[Lock Expires Automatically]
            J --> K[Job Available for Retry]
        end
    end

    style B fill:#90EE90
    style H fill:#90EE90
    style J fill:#90EE90
```

### Are you willing to help with this request?

Yes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Eliminate Orchestrator Orphaned Jobs #737

Is there an existing issue?

Motivation

Request

Solution

Are you willing to help with this request?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix: Eliminate Orchestrator Orphaned Jobs #737

Description

Is there an existing issue?

Motivation

Request

Solution

Are you willing to help with this request?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions