DataUpload tasks stuck in Prepared phase for extended periods

**What steps did you take and what happened:**

We observed multiple DataUpload tasks getting stuck in `Prepared` phase for extended periods (18+ hours) without progressing to `InProgress` phase. The tasks eventually get canceled after the `ItemOperationTimeout` (default 24 hours) expires.

**Observed behavior:**
1. DataUpload tasks transition from `Accepted` → `Prepared` phase successfully
2. Tasks remain in `Prepared` phase for many hours (observed 18+ hours)
3. No active data path operations are running (concurrency limit not reached)
4. Tasks eventually get canceled after `ItemOperationTimeout` expires
5. Restarting node-agent pods resolves the issue temporarily

**What did you expect to happen:**

Tasks in `Prepared` phase should progress to `InProgress` phase when:
1. The exposing pod becomes ready
2. A slot becomes available in the data path manager (concurrency limit allows)
3. The task receives a reconcile call to check these conditions

**Root Cause Analysis:**

After investigation, we identified that tasks in `Prepared` phase are **not included in the periodic enqueue source** used by the DataUpload controller. The periodic enqueue only includes tasks in `Accepted` phase (see `SetupWithManager` in `data_upload_controller.go`).

**Current reconcile triggers for Prepared phase:**
1. **Watch events on DataUpload CR** - when the CR is updated/changed
2. **Watch events on Pod** - when the exposing pod status changes (via `findDataUploadForPod`)
3. **Requeue after errors** - when `ConcurrentLimitExceed` occurs, the reconciler returns `Requeue: true, RequeueAfter: 5s`

**The Problem:**

When a task is in `Prepared` phase and:
- No DataUpload CR updates occur (no watch events)
- No Pod status changes occur (pod is already Running, no further updates)
- The task hit `ConcurrentLimitExceed` and was requeued, but the requeue mechanism may not work reliably if the reconciler wasn't called initially

The task will **not receive any reconcile calls** until one of the watch events occurs. This can lead to tasks being stuck indefinitely.

**Edge Case Scenario:**

1. Multiple DataUpload tasks transition to `Prepared` phase
2. They all try to acquire a slot in `dataPathMgr` (which has a default concurrency limit of 1)
3. One task succeeds, others hit `ConcurrentLimitExceed` and return `Requeue: true, RequeueAfter: 5s`
4. However, if the reconciler wasn't called initially (no watch events), the requeue mechanism doesn't help
5. Tasks remain in `Prepared` phase waiting for:
   - A slot to become available (when the active task completes)
   - A reconcile call to check if a slot is available
6. Without periodic reconcile, tasks only get reconcile calls if:
   - The DataUpload CR is updated (unlikely if nothing changes it)
   - The exposing Pod status changes (unlikely if pod is already Running)
7. Tasks get stuck until node-agent restart (which triggers reconcile for all resources via watch initialization)

**Evidence:**

- Tasks observed stuck in `Prepared` phase for 18+ hours
- No active data path operations running (concurrency limit not the issue)
- Restarting node-agent resolves the issue (confirms it's a reconcile call issue)
- Logs show `ConcurrentLimitExceed` errors at DEBUG level (making them invisible in normal logs)
- The periodic enqueue source predicate only includes `Accepted` phase, not `Prepared` phase

**Proposed Solution:**

Include `Prepared` phase tasks in the periodic enqueue source predicate to ensure they receive regular reconcile calls (every minute via `preparingMonitorFrequency`). This would guarantee that tasks in `Prepared` phase get periodic opportunities to:
- Check if a slot is available in `dataPathMgr`
- Retry operations that were previously blocked
- Progress to `InProgress` phase when conditions are met

**The following information will help us better understand what's going on**:

**Environment:**
- Velero version: v1.17.0
- Kubernetes version: v1.32.4
- Number of DataUpload tasks in cluster: 465 DataUpload tasks observed (212 Completed, 252 Canceled, 1 Failed). Many of the canceled tasks were stuck in `Prepared` phase for ~18 hours before being canceled.
- Node-agent configuration: Default concurrency limit (1)

**Velero Configuration:**
```yaml
deployNodeAgent: true
configuration:
  backupStorageLocation: null
  volumeSnapshotLocation: null
  namespace: cozy-velero
  features: EnableCSI
  defaultItemOperationTimeout: 24h
initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.12.1
    imagePullPolicy: IfNotPresent
    volumeMounts:
      - mountPath: /target
        name: plugins
```

**Logs:**

When the problem occurs, you should see:
- DataUpload tasks in `Prepared` phase for extended periods
- DEBUG level logs showing `ConcurrentLimitExceed` errors (if log level is set to DEBUG)
- No active data path operations despite tasks waiting

**Example log pattern:**
```
level=info msg="Data upload is prepared and should be processed by node-1 (node-1)"
level=debug msg="Data path instance is concurrent limited requeue later"
```

**Steps to Reproduce:**

1. Create multiple backup operations that generate DataUpload tasks
2. Ensure the data path manager concurrency limit is low (default is 1)
3. Wait for multiple tasks to reach `Prepared` phase
4. Observe that tasks remain in `Prepared` phase without progressing
5. Check that no watch events occur (no CR updates, no Pod status changes)
6. Tasks will remain stuck until node-agent restart or timeout

**Workaround:**

Restarting node-agent pods forces a reconcile for all DataUpload resources, allowing stuck tasks to progress. However, this is not a permanent solution.

**Additional Context:**

This issue affects both `DataUpload` and `DataDownload` controllers, as they share the same pattern of periodic enqueue configuration. The fix should be applied to both controllers.

**Related:**
- PR #9447 - Initial fix attempt (slot reservation mechanism)
- PR #9449 - Alternative fix (including Prepared in periodic enqueue)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataUpload tasks stuck in Prepared phase for extended periods #9453

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataUpload tasks stuck in Prepared phase for extended periods #9453

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions