Skip to content

Race condition in paused-replicas annotation causes ScaledObject to get stuck in inconsistent state #7231

@nusmql

Description

@nusmql

Report

When applying the autoscaling.keda.sh/paused-replicas annotation to a ScaledObject, there’s a race condition that can leave the system permanently inconsistent:

  • KEDA marks the object as paused at the target replica count.
  • The underlying Deployment remains at its previous replica count (not scaled).
  • The HPA and scale loop are both absent, so nothing corrects the state.
  • Manual intervention is required to recover.

This happens intermittently and is timing-dependent when toggling the annotation on/off.

Expected Behavior

Deployment scales to 0 replicas whenever autoscaling.keda.sh/paused-replicas: "0" is applied, and subsequent toggles reliably pause/resume without getting stuck.

Actual Behavior

Sometimes the Deployment stays at its original replica count after re-applying the pause annotation. KEDA reports Paused=True, but there is no HPA and no running scale loop, so reconciliation does not proceed.

Steps to Reproduce the Problem

  1. Create a ScaledObject with:
cooldownPeriod: 300
pollingInterval: 15
  1. Ensure the target Deployment is running with replicas > 0.
  2. Apply the pause annotation: kubectl annotate scaledobject <name> autoscaling.keda.sh/paused-replicas="0" --overwrite → First pause works; Deployment scales down as expected.
  3. Remove the annotation and wait until the Deployment scales back up and is running normally.
  4. Re-apply the same annotation: kubectl annotate scaledobject <name> autoscaling.keda.sh/paused-replicas="0" --overwrite
  5. Observe: from the second pause onward, scaling intermittently fails; the Deployment remains at its prior replica count and does not recover automatically.

Logs from KEDA operator

Timeline - Failed Case:

15:30:12.609 - Reconcile #1:
├─ Gets ScaledObject (Paused condition = False)
├─ Enters pause block (scaledToPausedCount = true)
├─ Stops scale loop
├─ Deletes HPA → Triggers Reconcile #2
├─ Sets Paused=True (in memory)
└─ Returns, status write begins (slow)
15:30:12.644 - Reconcile #2 (35ms later):
├─ Gets ScaledObject (Paused condition STILL False!)
├─ Status write from #1 not persisted yet
├─ Enters pause block again (scaledToPausedCount = true)
├─ Tries to stop already-stopped loop
├─ Log: "ScalableObject was not found in controller cache"
├─ Returns early
└─ NO HPA created, NO scale loop started
[Stuck permanently - no more reconciles]

Timeline - Success Case:

15:19:58.114 - Reconcile #1:
├─ Same as above
└─ Status write begins
15:19:58.166 - Reconcile #2 (52ms later):
├─ Gets ScaledObject (Paused condition = True) ✅
├─ Status write completed!
├─ checkIfTargetResourceReachPausedCount() → false
├─ Falls through to normal reconcile
├─ Creates NEW HPA ✅
├─ Starts NEW scale loop ✅
└─ Scale loop scales to 0 ✅

details pls refer to the attachment

SuccessCasse-Logs-2025-11-02 17_32_48.txt
FailedCase-Logs-2025-11-02 17_30_57.txt

KEDA Version

2.18.0

Kubernetes Version

1.31

Platform

Other

Scaler Details

prometheus

Anything else?

The issue is in (controllers/keda/scaledobject_controller.go), lines 243-246:

case needsToPause:
    scaledToPausedCount := true  // ← Dangerous default
    if conditions.GetPausedCondition().Status == metav1.ConditionTrue {
        // Only checks deployment state if condition already True
        scaledToPausedCount = r.checkIfTargetResourceReachPausedCount(...)
        if scaledToPausedCount {
            return // Already done
        }
    }
    if scaledToPausedCount {
        // Enters this block in BOTH reconciles during race
        stopScaleLoop()
        deleteHPA()
        conditions.SetPausedCondition(metav1.ConditionTrue, ...)
        return
    }

The problem: When Paused=False due to the race, scaledToPausedCount stays true and incorrectly enters the stop block again.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

To Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions