-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
What steps did you take and what happened:
We observed multiple DataUpload tasks getting stuck in Prepared phase for extended periods (18+ hours) without progressing to InProgress phase. The tasks eventually get canceled after the ItemOperationTimeout (default 24 hours) expires.
Observed behavior:
- DataUpload tasks transition from
Accepted→Preparedphase successfully - Tasks remain in
Preparedphase for many hours (observed 18+ hours) - No active data path operations are running (concurrency limit not reached)
- Tasks eventually get canceled after
ItemOperationTimeoutexpires - Restarting node-agent pods resolves the issue temporarily
What did you expect to happen:
Tasks in Prepared phase should progress to InProgress phase when:
- The exposing pod becomes ready
- A slot becomes available in the data path manager (concurrency limit allows)
- The task receives a reconcile call to check these conditions
Root Cause Analysis:
After investigation, we identified that tasks in Prepared phase are not included in the periodic enqueue source used by the DataUpload controller. The periodic enqueue only includes tasks in Accepted phase (see SetupWithManager in data_upload_controller.go).
Current reconcile triggers for Prepared phase:
- Watch events on DataUpload CR - when the CR is updated/changed
- Watch events on Pod - when the exposing pod status changes (via
findDataUploadForPod) - Requeue after errors - when
ConcurrentLimitExceedoccurs, the reconciler returnsRequeue: true, RequeueAfter: 5s
The Problem:
When a task is in Prepared phase and:
- No DataUpload CR updates occur (no watch events)
- No Pod status changes occur (pod is already Running, no further updates)
- The task hit
ConcurrentLimitExceedand was requeued, but the requeue mechanism may not work reliably if the reconciler wasn't called initially
The task will not receive any reconcile calls until one of the watch events occurs. This can lead to tasks being stuck indefinitely.
Edge Case Scenario:
- Multiple DataUpload tasks transition to
Preparedphase - They all try to acquire a slot in
dataPathMgr(which has a default concurrency limit of 1) - One task succeeds, others hit
ConcurrentLimitExceedand returnRequeue: true, RequeueAfter: 5s - However, if the reconciler wasn't called initially (no watch events), the requeue mechanism doesn't help
- Tasks remain in
Preparedphase waiting for:- A slot to become available (when the active task completes)
- A reconcile call to check if a slot is available
- Without periodic reconcile, tasks only get reconcile calls if:
- The DataUpload CR is updated (unlikely if nothing changes it)
- The exposing Pod status changes (unlikely if pod is already Running)
- Tasks get stuck until node-agent restart (which triggers reconcile for all resources via watch initialization)
Evidence:
- Tasks observed stuck in
Preparedphase for 18+ hours - No active data path operations running (concurrency limit not the issue)
- Restarting node-agent resolves the issue (confirms it's a reconcile call issue)
- Logs show
ConcurrentLimitExceederrors at DEBUG level (making them invisible in normal logs) - The periodic enqueue source predicate only includes
Acceptedphase, notPreparedphase
Proposed Solution:
Include Prepared phase tasks in the periodic enqueue source predicate to ensure they receive regular reconcile calls (every minute via preparingMonitorFrequency). This would guarantee that tasks in Prepared phase get periodic opportunities to:
- Check if a slot is available in
dataPathMgr - Retry operations that were previously blocked
- Progress to
InProgressphase when conditions are met
The following information will help us better understand what's going on:
Environment:
- Velero version: v1.17.0
- Kubernetes version: v1.32.4
- Number of DataUpload tasks in cluster: 465 DataUpload tasks observed (212 Completed, 252 Canceled, 1 Failed). Many of the canceled tasks were stuck in
Preparedphase for ~18 hours before being canceled. - Node-agent configuration: Default concurrency limit (1)
Velero Configuration:
deployNodeAgent: true
configuration:
backupStorageLocation: null
volumeSnapshotLocation: null
namespace: cozy-velero
features: EnableCSI
defaultItemOperationTimeout: 24h
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.12.1
imagePullPolicy: IfNotPresent
volumeMounts:
- mountPath: /target
name: pluginsLogs:
When the problem occurs, you should see:
- DataUpload tasks in
Preparedphase for extended periods - DEBUG level logs showing
ConcurrentLimitExceederrors (if log level is set to DEBUG) - No active data path operations despite tasks waiting
Example log pattern:
level=info msg="Data upload is prepared and should be processed by node-1 (node-1)"
level=debug msg="Data path instance is concurrent limited requeue later"
Steps to Reproduce:
- Create multiple backup operations that generate DataUpload tasks
- Ensure the data path manager concurrency limit is low (default is 1)
- Wait for multiple tasks to reach
Preparedphase - Observe that tasks remain in
Preparedphase without progressing - Check that no watch events occur (no CR updates, no Pod status changes)
- Tasks will remain stuck until node-agent restart or timeout
Workaround:
Restarting node-agent pods forces a reconcile for all DataUpload resources, allowing stuck tasks to progress. However, this is not a permanent solution.
Additional Context:
This issue affects both DataUpload and DataDownload controllers, as they share the same pattern of periodic enqueue configuration. The fix should be applied to both controllers.
Related:
- PR [AI-generated]Fix race condition: check concurrent limit before GetExposed #9447 - Initial fix attempt (slot reservation mechanism)
- PR [AI-generated]Include Prepared phase tasks in periodic enqueue to prevent stalling #9449 - Alternative fix (including Prepared in periodic enqueue)