-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Is there an existing issue?
- I have searched the existing issues
Motivation
The Madara Orchestrator has a critical production vulnerability where jobs become permanently orphaned when workers crash during processing. This happens because we use JobStatus::LockedForProcessing in the database for concurrency control, mixing business logic with locking mechanisms.
Current Impact:
- Jobs stuck in
LockedForProcessingstatus never get processed again - Manual operations team intervention required to recover orphaned jobs
- No automatic detection or recovery mechanism
- Business continuity risk for settlement and proving pipelines
- Production outages when critical state transitions get stuck
graph TD
A[Job: Created] --> B[Worker Starts Processing]
B --> C[Status: LockedForProcessing]
C --> D{Worker Crashes?}
D -->|No| E[Process Successfully]
D -->|Yes| F[🚨 ORPHANED JOB]
E --> G[Status: PendingVerification]
F --> H[Job Stuck Forever - Manual Fix Required]
style F fill:#ff6b6b
style H fill:#ff6b6b
What led to this issue:
Workers crash after setting job status to LockedForProcessing but before completing the job, leaving it permanently orphaned in the database. The system has no self-healing capability to recover from this state.
Request
Problem: Replace the vulnerable JobStatus::LockedForProcessing database status with a distributed locking system that provides automatic cleanup and self-healing capabilities.
Core Requirements:
- Eliminate Orphaned Jobs: Zero jobs should ever be permanently stuck
- Self-Healing: Automatic recovery within 60 minutes maximum when workers crash
- Clean Architecture: Separate business logic (job states) from concurrency control (locking)
- Production Reliability: No manual intervention required for job recovery
Solution
Approach: Replace database status locking with external distributed locks that automatically expire.
High-Level Solution:
- Cache-Based Locking: Use the distributed locking system (prerequisite) with TTL-based automatic cleanup
- Status Cleanup: Remove
LockedForProcessingand keep jobs inCreatedstatus during processing - External Coordination: Handle concurrency control outside the job database using distributed locks
- Self-Healing: TTL ensures locks expire automatically, making jobs available for retry
graph TB
subgraph "New Self-Healing Architecture"
A[Job: Created] --> B[Acquire Cache Lock]
B --> C{Lock Acquired?}
C -->|Yes| D[Process Job - Status Stays Created]
C -->|No| E[Skip - Another Worker Processing]
D --> F[Status: PendingVerification]
D --> G[Release Lock]
subgraph "Automatic Recovery"
H[Lock TTL: 30 minutes]
H --> I[Worker Crashes]
I --> J[Lock Expires Automatically]
J --> K[Job Available for Retry]
end
end
style B fill:#90EE90
style H fill:#90EE90
style J fill:#90EE90
Are you willing to help with this request?
Yes!