-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[AI-generated]Fix race condition: check concurrent limit before GetExposed #9447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
194a7d9 to
1d5f116
Compare
Add CanAcceptNewTask() method to datapath.Manager to check if a new task can be accepted before executing expensive GetExposed operation. This prevents multiple tasks from wasting time on GetExposed when the concurrent limit is already reached, which was causing tasks to get ConcurrentLimitExceed errors after spending time on GetExposed. Applied to all controllers: - DataUploadReconciler - DataDownloadReconciler - PodVolumeBackupReconciler - PodVolumeRestoreReconciler Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Have you tested it with large number of tasks and prove the effect? I think current changes won't work well for the expectation. As this flow mentioned above:
Even we have the current change:
|
The previous commit added CanAcceptNewTask() check before GetExposed, but this still allowed a race condition where multiple tasks could pass the check simultaneously before any completed GetExposed and added to tracker. Replace the check with atomic ReserveSlot() mechanism that reserves a slot in the tracker before GetExposed, ensuring only tasks with successfully reserved slots proceed to expensive operations. Changes: - Add ReserveSlot() method to atomically reserve slots in tracker - Add ReleaseReservation() to release slots on error paths - Update CreateMicroServiceBRWatcher to handle reservations - Update all controllers to use ReserveSlot before GetExposed This ensures concurrent limit is enforced atomically - when cocurrentNum=1, only one task can reserve a slot at a time, completely preventing the race condition. Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
@Lyndon-Li sorry you talk with my AI agent 😊 I provide reponse bellow: You're absolutely right! My initial fix didn't fully solve the race condition. The issue is that I've now implemented a reservation system that atomically reserves slots in the tracker before Changes
How it works
This ensures that:
The reservation is counted in the concurrency limit, so if ContextIn my cluster, I have 465 DataUpload tasks total, with:
The race condition was particularly problematic with |
|
I just thought about it further, there is no way that the problem you mentioned could happen, because there is only one reconciler thread, or in another word, the tasks are actually processed synchronously in each node. |
|
@Lyndon-Li I just had the issue when all my datauploads were in After I restarted, they started accpting new jobs and some of |
Tasks in Prepared phase were not included in periodic enqueue (only Accepted phase was included). This meant they only received reconcile calls through watch events, which could cause them to get stuck for long periods when waiting for available slots. This change adds Prepared phase to the periodic enqueue predicate for both DataUpload and DataDownload controllers, ensuring they get regular reconcile calls (every minute via preparingMonitorFrequency). This is an alternative proposed solution to the issue described in vmware-tanzu#9447 Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
This stuck problem looks different from the problem the PR is try to solve --- CPU time waste. For CPU time waste:
|
|
Sure, done #9453 |
Fix race condition in data upload controller concurrent limit check
Summary
Fixed a race condition where multiple DataUpload tasks were wasting time on expensive
GetExposedoperations even when the concurrent limit was already reached. This caused all tasks to getConcurrentLimitExceederrors after spending time onGetExposed, leading to inefficient processing and tasks stuck inPreparedstate.Changes
Added
CanAcceptNewTask()method todatapath.Manager(pkg/datapath/manager.go):trueiflen(tracker) < cocurrentNumAdded concurrent limit check BEFORE
GetExposedin all controllers:DataUploadReconciler(pkg/controller/data_upload_controller.go)DataDownloadReconciler(pkg/controller/data_download_controller.go)PodVolumeBackupReconciler(pkg/controller/pod_volume_backup_controller.go)PodVolumeRestoreReconciler(pkg/controller/pod_volume_restore_controller.go)Problem
Previously, the concurrent limit was checked AFTER executing the expensive
GetExposedoperation:GetAsyncBR(taskName)→ all getnil(their names not in tracker yet)GetExposed(expensive, can take seconds)CreateMicroServiceBRWatchersimultaneouslyConcurrentLimitExceedThis meant tasks were wasting CPU and time on
GetExposedeven when the concurrent limit was already reached, causing inefficiency and delays.Solution
The fix adds an early check for the concurrent limit BEFORE executing
GetExposed:Now tasks immediately requeue if the limit is reached, without wasting resources on
GetExposed.Benefits
GetExposedwhen limit is reachedTesting
The change is thread-safe as
CanAcceptNewTaskuses the same mutex as other tracker operations. The logic matches the existing check inCreateMicroServiceBRWatcherbut allows early exit before expensive operations.Does your change fix a particular issue?
Yes, this fixes a race condition that causes DataUpload tasks to waste resources and get stuck when concurrent limit is reached.
Please indicate you've done the following:
make new-changelog) or comment/kind changelog-not-requiredon this PR.site/content/docs/main.