-
Notifications
You must be signed in to change notification settings - Fork 735
scheduler: stop the running job when it has been balanced #9479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: 童剑 <[email protected]>
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: 童剑 <[email protected]>
c1709f1
to
70b3064
Compare
Signed-off-by: 童剑 <[email protected]>
f310e4f
to
cc49004
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a mechanism to stop a running balance-range job once the specified key ranges are deemed balanced.
- Introduce
defaultBalancedThresholdRatio
andisBalanced()
logic to detect balanced ranges. - Cache the computed plan in the scheduler to avoid redundant prepares and stop scheduling when balanced.
- Add new Prometheus metrics (
scheduled
,prepare-failed
) and update unit tests for the balance check.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
pkg/schedule/schedulers/metrics.go | Added balanceRangeScheduledCounter and balanceRangePrepareFailedCounter . |
pkg/schedule/schedulers/balance_range.go | Introduced balanced-threshold logic, cached plan in scheduler, updated IsScheduleAllowed /Schedule . |
pkg/schedule/schedulers/balance_range_test.go | Expanded test range and added assertions for the new isBalanced behavior. |
Comments suppressed due to low confidence (1)
pkg/schedule/schedulers/balance_range.go:415
- Storing
plan
on the scheduler struct can introduce data races ifIsScheduleAllowed
andSchedule
run concurrently. Consider passing the plan through method arguments or protecting access with a mutex.
s.plan = plan
maxScore := scoreMap[sources[0].GetID()] | ||
minScore := scoreMap[sources[len(sources)-1].GetID()] | ||
balancedThreshold := int64(float64(averageScore) * defaultBalancedThresholdRatio) | ||
if balancedThreshold < 2 { | ||
balancedThreshold = 2 | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating maxScore
and minScore
from sources
before sorting can yield incorrect values. Move the max/min computation after sorting or compute them by scanning scoreMap
directly.
maxScore := scoreMap[sources[0].GetID()] | |
minScore := scoreMap[sources[len(sources)-1].GetID()] | |
balancedThreshold := int64(float64(averageScore) * defaultBalancedThresholdRatio) | |
if balancedThreshold < 2 { | |
balancedThreshold = 2 | |
} |
Copilot uses AI. Check for mistakes.
// todo: don't prepare every times, the prepare information can be reused. | ||
plan, err := s.prepare(cluster, opInfluence, job) | ||
if err != nil { | ||
log.Warn("failed to prepare balance key range scheduler", errs.ZapError(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When prepare fails in IsScheduleAllowed
, the new balanceRangePrepareFailedCounter
metric should be incremented to track these failures consistently.
log.Warn("failed to prepare balance key range scheduler", errs.ZapError(err)) | |
log.Warn("failed to prepare balance key range scheduler", errs.ZapError(err)) | |
balanceRangePrepareFailedCounter.Inc() |
Copilot uses AI. Check for mistakes.
What problem does this PR solve?
Issue Number: Close #9484
What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
Release note