Skip to content

DataUpload tasks stuck in Prepared phase for extended periods #9453

@kvaps

Description

@kvaps

What steps did you take and what happened:

We observed multiple DataUpload tasks getting stuck in Prepared phase for extended periods (18+ hours) without progressing to InProgress phase. The tasks eventually get canceled after the ItemOperationTimeout (default 24 hours) expires.

Observed behavior:

  1. DataUpload tasks transition from AcceptedPrepared phase successfully
  2. Tasks remain in Prepared phase for many hours (observed 18+ hours)
  3. No active data path operations are running (concurrency limit not reached)
  4. Tasks eventually get canceled after ItemOperationTimeout expires
  5. Restarting node-agent pods resolves the issue temporarily

What did you expect to happen:

Tasks in Prepared phase should progress to InProgress phase when:

  1. The exposing pod becomes ready
  2. A slot becomes available in the data path manager (concurrency limit allows)
  3. The task receives a reconcile call to check these conditions

Root Cause Analysis:

After investigation, we identified that tasks in Prepared phase are not included in the periodic enqueue source used by the DataUpload controller. The periodic enqueue only includes tasks in Accepted phase (see SetupWithManager in data_upload_controller.go).

Current reconcile triggers for Prepared phase:

  1. Watch events on DataUpload CR - when the CR is updated/changed
  2. Watch events on Pod - when the exposing pod status changes (via findDataUploadForPod)
  3. Requeue after errors - when ConcurrentLimitExceed occurs, the reconciler returns Requeue: true, RequeueAfter: 5s

The Problem:

When a task is in Prepared phase and:

  • No DataUpload CR updates occur (no watch events)
  • No Pod status changes occur (pod is already Running, no further updates)
  • The task hit ConcurrentLimitExceed and was requeued, but the requeue mechanism may not work reliably if the reconciler wasn't called initially

The task will not receive any reconcile calls until one of the watch events occurs. This can lead to tasks being stuck indefinitely.

Edge Case Scenario:

  1. Multiple DataUpload tasks transition to Prepared phase
  2. They all try to acquire a slot in dataPathMgr (which has a default concurrency limit of 1)
  3. One task succeeds, others hit ConcurrentLimitExceed and return Requeue: true, RequeueAfter: 5s
  4. However, if the reconciler wasn't called initially (no watch events), the requeue mechanism doesn't help
  5. Tasks remain in Prepared phase waiting for:
    • A slot to become available (when the active task completes)
    • A reconcile call to check if a slot is available
  6. Without periodic reconcile, tasks only get reconcile calls if:
    • The DataUpload CR is updated (unlikely if nothing changes it)
    • The exposing Pod status changes (unlikely if pod is already Running)
  7. Tasks get stuck until node-agent restart (which triggers reconcile for all resources via watch initialization)

Evidence:

  • Tasks observed stuck in Prepared phase for 18+ hours
  • No active data path operations running (concurrency limit not the issue)
  • Restarting node-agent resolves the issue (confirms it's a reconcile call issue)
  • Logs show ConcurrentLimitExceed errors at DEBUG level (making them invisible in normal logs)
  • The periodic enqueue source predicate only includes Accepted phase, not Prepared phase

Proposed Solution:

Include Prepared phase tasks in the periodic enqueue source predicate to ensure they receive regular reconcile calls (every minute via preparingMonitorFrequency). This would guarantee that tasks in Prepared phase get periodic opportunities to:

  • Check if a slot is available in dataPathMgr
  • Retry operations that were previously blocked
  • Progress to InProgress phase when conditions are met

The following information will help us better understand what's going on:

Environment:

  • Velero version: v1.17.0
  • Kubernetes version: v1.32.4
  • Number of DataUpload tasks in cluster: 465 DataUpload tasks observed (212 Completed, 252 Canceled, 1 Failed). Many of the canceled tasks were stuck in Prepared phase for ~18 hours before being canceled.
  • Node-agent configuration: Default concurrency limit (1)

Velero Configuration:

deployNodeAgent: true
configuration:
  backupStorageLocation: null
  volumeSnapshotLocation: null
  namespace: cozy-velero
  features: EnableCSI
  defaultItemOperationTimeout: 24h
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.12.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins

Logs:

When the problem occurs, you should see:

  • DataUpload tasks in Prepared phase for extended periods
  • DEBUG level logs showing ConcurrentLimitExceed errors (if log level is set to DEBUG)
  • No active data path operations despite tasks waiting

Example log pattern:

level=info msg="Data upload is prepared and should be processed by node-1 (node-1)"
level=debug msg="Data path instance is concurrent limited requeue later"

Steps to Reproduce:

  1. Create multiple backup operations that generate DataUpload tasks
  2. Ensure the data path manager concurrency limit is low (default is 1)
  3. Wait for multiple tasks to reach Prepared phase
  4. Observe that tasks remain in Prepared phase without progressing
  5. Check that no watch events occur (no CR updates, no Pod status changes)
  6. Tasks will remain stuck until node-agent restart or timeout

Workaround:

Restarting node-agent pods forces a reconcile for all DataUpload resources, allowing stuck tasks to progress. However, this is not a permanent solution.

Additional Context:

This issue affects both DataUpload and DataDownload controllers, as they share the same pattern of periodic enqueue configuration. The fix should be applied to both controllers.

Related:

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions