[AI-generated]Fix race condition: check concurrent limit before GetExposed #9447

kvaps · 2025-12-10T18:58:17Z

Fix race condition in data upload controller concurrent limit check

Summary

Fixed a race condition where multiple DataUpload tasks were wasting time on expensive GetExposed operations even when the concurrent limit was already reached. This caused all tasks to get ConcurrentLimitExceed errors after spending time on GetExposed, leading to inefficient processing and tasks stuck in Prepared state.

Changes

Added CanAcceptNewTask() method to datapath.Manager (pkg/datapath/manager.go):
- Lightweight check that verifies if a new task can be accepted based on concurrent limit
- Thread-safe with mutex protection
- Returns true if len(tracker) < cocurrentNum
Added concurrent limit check BEFORE GetExposed in all controllers:
- DataUploadReconciler (pkg/controller/data_upload_controller.go)
- DataDownloadReconciler (pkg/controller/data_download_controller.go)
- PodVolumeBackupReconciler (pkg/controller/pod_volume_backup_controller.go)
- PodVolumeRestoreReconciler (pkg/controller/pod_volume_restore_controller.go)

Problem

Previously, the concurrent limit was checked AFTER executing the expensive GetExposed operation:

Multiple tasks would check GetAsyncBR(taskName) → all get nil (their names not in tracker yet)
All tasks would execute GetExposed (expensive, can take seconds)
All tasks would reach CreateMicroServiceBRWatcher simultaneously
Only the first task would succeed (limit = 1), others get ConcurrentLimitExceed
Error was logged at DEBUG level, so not visible in normal logs
Tasks would requeue and repeat the cycle

This meant tasks were wasting CPU and time on GetExposed even when the concurrent limit was already reached, causing inefficiency and delays.

Solution

The fix adds an early check for the concurrent limit BEFORE executing GetExposed:

// Check if we can accept a new task BEFORE doing expensive GetExposed operation
if !r.dataPathMgr.CanAcceptNewTask(false) {
    log.Debug("Data path concurrent limit reached, requeue later")
    return ctrl.Result{Requeue: true, RequeueAfter: time.Second * 5}, nil
}

Now tasks immediately requeue if the limit is reached, without wasting resources on GetExposed.

Benefits

Resource efficiency: Tasks don't waste time on GetExposed when limit is reached
Fairer queue: Tasks check limit earlier and requeue, freeing resources faster
Fewer race conditions: Limit check happens before expensive operations
Universal fix: Applied to all controllers sharing the common tracker

Testing

The change is thread-safe as CanAcceptNewTask uses the same mutex as other tracker operations. The logic matches the existing check in CreateMicroServiceBRWatcher but allows early exit before expensive operations.

Does your change fix a particular issue?

Yes, this fixes a race condition that causes DataUpload tasks to waste resources and get stuck when concurrent limit is reached.

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file (make new-changelog) or comment /kind changelog-not-required on this PR.
Updated the corresponding documentation in site/content/docs/main.

Add CanAcceptNewTask() method to datapath.Manager to check if a new task can be accepted before executing expensive GetExposed operation. This prevents multiple tasks from wasting time on GetExposed when the concurrent limit is already reached, which was causing tasks to get ConcurrentLimitExceed errors after spending time on GetExposed. Applied to all controllers: - DataUploadReconciler - DataDownloadReconciler - PodVolumeBackupReconciler - PodVolumeRestoreReconciler Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Lyndon-Li · 2025-12-11T03:20:27Z

Have you tested it with large number of tasks and prove the effect? I think current changes won't work well for the expectation.

As this flow mentioned above:

Multiple tasks would check GetAsyncBR(taskName) → all get nil (their names not in tracker yet)
All tasks would execute GetExposed (expensive, can take seconds)

Even we have the current change:

Multiple tasks would check GetAsyncBR(taskName) → all get nil (their names not in tracker yet)
All tasks call r.dataPathMgr.CanAcceptNewTask. Since GetExposed is time consuming and len(m.tracker) is increased ONLY AFTER GetExposed in CreateMicroServiceBRWatcher, most probably, all tasks get passed by CanAcceptNewTask
Then STILL all tasks call GetExposed

The previous commit added CanAcceptNewTask() check before GetExposed, but this still allowed a race condition where multiple tasks could pass the check simultaneously before any completed GetExposed and added to tracker. Replace the check with atomic ReserveSlot() mechanism that reserves a slot in the tracker before GetExposed, ensuring only tasks with successfully reserved slots proceed to expensive operations. Changes: - Add ReserveSlot() method to atomically reserve slots in tracker - Add ReleaseReservation() to release slots on error paths - Update CreateMicroServiceBRWatcher to handle reservations - Update all controllers to use ReserveSlot before GetExposed This ensures concurrent limit is enforced atomically - when cocurrentNum=1, only one task can reserve a slot at a time, completely preventing the race condition. Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

kvaps · 2025-12-11T09:32:45Z

@Lyndon-Li sorry you talk with my AI agent 😊 I provide reponse bellow:

You're absolutely right! My initial fix didn't fully solve the race condition. The issue is that CanAcceptNewTask() and CreateMicroServiceBRWatcher() are separate non-atomic operations, and multiple tasks can pass the check simultaneously before any of them completes GetExposed and adds to the tracker.

I've now implemented a reservation system that atomically reserves slots in the tracker before GetExposed:

Changes

Added ReserveSlot(taskName) method - Atomically reserves a slot by adding to a reservations map, which is counted along with the tracker for concurrency limits
Added ReleaseReservation(taskName) method - Releases a reserved slot if GetExposed or subsequent operations fail
Modified CreateMicroServiceBRWatcher - Automatically releases the reservation when adding the actual AsyncBR to tracker
Updated all controllers - Now call ReserveSlot() BEFORE GetExposed, and ReleaseReservation() on any error path

How it works

Task calls ReserveSlot(taskName) → atomically checks len(tracker) + len(reservations) < cocurrentNum and adds to reservations map
Task proceeds with GetExposed() - only if reservation succeeded
If GetExposed succeeds → CreateMicroServiceBRWatcher moves from reservations to tracker atomically
If GetExposed fails → ReleaseReservation removes from reservations

This ensures that:

Only tasks with successfully reserved slots proceed to GetExposed
The concurrent limit check is atomic (protected by mutex)
Other tasks see len(tracker) + len(reservations) >= cocurrentNum immediately and requeue
No wasted CPU/time on GetExposed when limit is reached

The reservation is counted in the concurrency limit, so if cocurrentNum=1, only one task can reserve a slot at a time, preventing the race condition completely.

Context

In my cluster, I have 465 DataUpload tasks total, with:

212 Completed
252 Canceled (many due to the race condition issue)
1 Failed

The race condition was particularly problematic with cocurrentNum=1 and large number of tasks, as all tasks would repeatedly try to execute GetExposed simultaneously, leading to wasted resources and tasks getting stuck in Prepared state before eventually being canceled.

Lyndon-Li · 2025-12-11T10:09:59Z

I just thought about it further, there is no way that the problem you mentioned could happen, because there is only one reconciler thread, or in another word, the tasks are actually processed synchronously in each node.
The asynchronization comes from the asynchronous processing of InProgress tasks.

kvaps · 2025-12-11T10:36:08Z

@Lyndon-Li I just had the issue when all my datauploads were in Prepared state and no active InProggress jobs.
Velero-agents were succefully running for 30 days, and do nothing with them.

After I restarted, they started accpting new jobs and some of Prepared jobs become InProgress, later whole queue was successfully finished

Tasks in Prepared phase were not included in periodic enqueue (only Accepted phase was included). This meant they only received reconcile calls through watch events, which could cause them to get stuck for long periods when waiting for available slots. This change adds Prepared phase to the periodic enqueue predicate for both DataUpload and DataDownload controllers, ensuring they get regular reconcile calls (every minute via preparingMonitorFrequency). This is an alternative proposed solution to the issue described in vmware-tanzu#9447 Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Lyndon-Li · 2025-12-12T03:00:41Z

@Lyndon-Li I just had the issue when all my datauploads were in Prepared state and no active InProggress jobs. Velero-agents were succefully running for 30 days, and do nothing with them.

After I restarted, they started accpting new jobs and some of Prepared jobs become InProgress, later whole queue was successfully finished

This stuck problem looks different from the problem the PR is try to solve --- CPU time waste.
For the stuck problem, please open an issue, see #9449 (comment).

For CPU time waste:

We cannot periodically enqueue Prepared tasks, this makes the situation even worse
GetExposed is not very time consuming, when the task is ready, GetExposed means a few API server calls only
Tasks in Prepared phases won't squeeze one another as they are processed synchronously

kvaps · 2025-12-12T09:52:01Z

Sure, done #9453

github-actions bot assigned kvaps Dec 10, 2025

github-actions bot requested review from anshulahuja98 and kaovilai December 10, 2025 18:58

kvaps force-pushed the fix-race branch 2 times, most recently from 194a7d9 to 1d5f116 Compare December 10, 2025 18:58

kvaps force-pushed the fix-race branch from 1d5f116 to f8f295e Compare December 10, 2025 19:00

github-actions bot added the has-changelog label Dec 10, 2025

kvaps mentioned this pull request Dec 10, 2025

[velero] Add velero image build with race condition fix patch cozystack/cozystack#1711

Open

kvaps mentioned this pull request Dec 11, 2025

[AI-generated]Include Prepared phase tasks in periodic enqueue to prevent stalling #9449

Open

kvaps mentioned this pull request Dec 12, 2025

DataUpload tasks stuck in Prepared phase for extended periods #9453

Open

blackpiglet changed the title ~~Fix race condition: check concurrent limit before GetExposed~~ [AI-generated]Fix race condition: check concurrent limit before GetExposed Dec 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AI-generated]Fix race condition: check concurrent limit before GetExposed #9447

[AI-generated]Fix race condition: check concurrent limit before GetExposed #9447

kvaps commented Dec 10, 2025 •

edited

Loading

Uh oh!

Lyndon-Li commented Dec 11, 2025

Uh oh!

kvaps commented Dec 11, 2025

Uh oh!

Lyndon-Li commented Dec 11, 2025 •

edited

Loading

Uh oh!

kvaps commented Dec 11, 2025 •

edited

Loading

Uh oh!

Lyndon-Li commented Dec 12, 2025 •

edited

Loading

Uh oh!

kvaps commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[AI-generated]Fix race condition: check concurrent limit before GetExposed #9447

Are you sure you want to change the base?

[AI-generated]Fix race condition: check concurrent limit before GetExposed #9447

Conversation

kvaps commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix race condition in data upload controller concurrent limit check

Summary

Changes

Problem

Solution

Benefits

Testing

Does your change fix a particular issue?

Please indicate you've done the following:

Uh oh!

Lyndon-Li commented Dec 11, 2025

Uh oh!

kvaps commented Dec 11, 2025

Changes

How it works

Context

Uh oh!

Lyndon-Li commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvaps commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lyndon-Li commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kvaps commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kvaps commented Dec 10, 2025 •

edited

Loading

Lyndon-Li commented Dec 11, 2025 •

edited

Loading

kvaps commented Dec 11, 2025 •

edited

Loading

Lyndon-Li commented Dec 12, 2025 •

edited

Loading