-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Report
When applying the autoscaling.keda.sh/paused-replicas annotation to a ScaledObject, there’s a race condition that can leave the system permanently inconsistent:
- KEDA marks the object as paused at the target replica count.
- The underlying Deployment remains at its previous replica count (not scaled).
- The HPA and scale loop are both absent, so nothing corrects the state.
- Manual intervention is required to recover.
This happens intermittently and is timing-dependent when toggling the annotation on/off.
Expected Behavior
Deployment scales to 0 replicas whenever autoscaling.keda.sh/paused-replicas: "0" is applied, and subsequent toggles reliably pause/resume without getting stuck.
Actual Behavior
Sometimes the Deployment stays at its original replica count after re-applying the pause annotation. KEDA reports Paused=True, but there is no HPA and no running scale loop, so reconciliation does not proceed.
Steps to Reproduce the Problem
- Create a ScaledObject with:
cooldownPeriod: 300
pollingInterval: 15- Ensure the target Deployment is running with replicas > 0.
- Apply the pause annotation:
kubectl annotate scaledobject <name> autoscaling.keda.sh/paused-replicas="0" --overwrite→ First pause works; Deployment scales down as expected. - Remove the annotation and wait until the Deployment scales back up and is running normally.
- Re-apply the same annotation:
kubectl annotate scaledobject <name> autoscaling.keda.sh/paused-replicas="0" --overwrite - Observe: from the second pause onward, scaling intermittently fails; the Deployment remains at its prior replica count and does not recover automatically.
Logs from KEDA operator
Timeline - Failed Case:
15:30:12.609 - Reconcile #1:
├─ Gets ScaledObject (Paused condition = False)
├─ Enters pause block (scaledToPausedCount = true)
├─ Stops scale loop
├─ Deletes HPA → Triggers Reconcile #2
├─ Sets Paused=True (in memory)
└─ Returns, status write begins (slow)
15:30:12.644 - Reconcile #2 (35ms later):
├─ Gets ScaledObject (Paused condition STILL False!)
├─ Status write from #1 not persisted yet
├─ Enters pause block again (scaledToPausedCount = true)
├─ Tries to stop already-stopped loop
├─ Log: "ScalableObject was not found in controller cache"
├─ Returns early
└─ NO HPA created, NO scale loop started
[Stuck permanently - no more reconciles]
Timeline - Success Case:
15:19:58.114 - Reconcile #1:
├─ Same as above
└─ Status write begins
15:19:58.166 - Reconcile #2 (52ms later):
├─ Gets ScaledObject (Paused condition = True) ✅
├─ Status write completed!
├─ checkIfTargetResourceReachPausedCount() → false
├─ Falls through to normal reconcile
├─ Creates NEW HPA ✅
├─ Starts NEW scale loop ✅
└─ Scale loop scales to 0 ✅
details pls refer to the attachment
SuccessCasse-Logs-2025-11-02 17_32_48.txt
FailedCase-Logs-2025-11-02 17_30_57.txt
KEDA Version
2.18.0
Kubernetes Version
1.31
Platform
Other
Scaler Details
prometheus
Anything else?
The issue is in (controllers/keda/scaledobject_controller.go), lines 243-246:
case needsToPause:
scaledToPausedCount := true // ← Dangerous default
if conditions.GetPausedCondition().Status == metav1.ConditionTrue {
// Only checks deployment state if condition already True
scaledToPausedCount = r.checkIfTargetResourceReachPausedCount(...)
if scaledToPausedCount {
return // Already done
}
}
if scaledToPausedCount {
// Enters this block in BOTH reconciles during race
stopScaleLoop()
deleteHPA()
conditions.SetPausedCondition(metav1.ConditionTrue, ...)
return
}The problem: When Paused=False due to the race, scaledToPausedCount stays true and incorrectly enters the stop block again.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status