Skip to content

scheduler: stop the running job when it has been balanced #9479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bufferflies
Copy link
Contributor

@bufferflies bufferflies commented Jul 3, 2025

What problem does this PR solve?

Issue Number: Close #9484

What is changed and how does it work?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

Related changes

Release note

None.

Copy link
Contributor

ti-chi-bot bot commented Jul 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Jul 3, 2025
Copy link
Contributor

ti-chi-bot bot commented Jul 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign connor1996 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 3, 2025
Signed-off-by: 童剑 <[email protected]>
@bufferflies bufferflies marked this pull request as ready for review July 3, 2025 09:41
@ti-chi-bot ti-chi-bot bot removed do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jul 3, 2025
@bufferflies bufferflies changed the title scheduler: stop job when job has been balanced scheduler: stop the running job when it has been balanced Jul 3, 2025
Signed-off-by: 童剑 <[email protected]>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a mechanism to stop a running balance-range job once the specified key ranges are deemed balanced.

  • Introduce defaultBalancedThresholdRatio and isBalanced() logic to detect balanced ranges.
  • Cache the computed plan in the scheduler to avoid redundant prepares and stop scheduling when balanced.
  • Add new Prometheus metrics (scheduled, prepare-failed) and update unit tests for the balance check.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pkg/schedule/schedulers/metrics.go Added balanceRangeScheduledCounter and balanceRangePrepareFailedCounter.
pkg/schedule/schedulers/balance_range.go Introduced balanced-threshold logic, cached plan in scheduler, updated IsScheduleAllowed/Schedule.
pkg/schedule/schedulers/balance_range_test.go Expanded test range and added assertions for the new isBalanced behavior.
Comments suppressed due to low confidence (1)

pkg/schedule/schedulers/balance_range.go:415

  • Storing plan on the scheduler struct can introduce data races if IsScheduleAllowed and Schedule run concurrently. Consider passing the plan through method arguments or protecting access with a mutex.
		s.plan = plan

Comment on lines +673 to 679
maxScore := scoreMap[sources[0].GetID()]
minScore := scoreMap[sources[len(sources)-1].GetID()]
balancedThreshold := int64(float64(averageScore) * defaultBalancedThresholdRatio)
if balancedThreshold < 2 {
balancedThreshold = 2
}

Copy link
Preview

Copilot AI Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculating maxScore and minScore from sources before sorting can yield incorrect values. Move the max/min computation after sorting or compute them by scanning scoreMap directly.

Suggested change
maxScore := scoreMap[sources[0].GetID()]
minScore := scoreMap[sources[len(sources)-1].GetID()]
balancedThreshold := int64(float64(averageScore) * defaultBalancedThresholdRatio)
if balancedThreshold < 2 {
balancedThreshold = 2
}

Copilot uses AI. Check for mistakes.

// todo: don't prepare every times, the prepare information can be reused.
plan, err := s.prepare(cluster, opInfluence, job)
if err != nil {
log.Warn("failed to prepare balance key range scheduler", errs.ZapError(err))
Copy link
Preview

Copilot AI Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When prepare fails in IsScheduleAllowed, the new balanceRangePrepareFailedCounter metric should be incremented to track these failures consistently.

Suggested change
log.Warn("failed to prepare balance key range scheduler", errs.ZapError(err))
log.Warn("failed to prepare balance key range scheduler", errs.ZapError(err))
balanceRangePrepareFailedCounter.Inc()

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler: finish job when the key range of the job range has been balanced
1 participant