Skip to content

Set root capability only when user not set it #4354

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

houyuting
Copy link

@houyuting houyuting commented Jun 9, 2025

Fixes #4350

Result:

UT- when I use old code, I got:
image

UT- when I use new code, I got:
image

@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2025
@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 9, 2025
@volcano-sh-bot volcano-sh-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 9, 2025
@houyuting houyuting marked this pull request as ready for review June 9, 2025 09:23
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2025
@@ -598,7 +598,9 @@ func (cp *capacityPlugin) buildHierarchicalQueueAttrs(ssn *framework.Session) bo

// init root queue realCapability/capability/deserved as cp.totalResource
rootQueueAttr := cp.queueOpts[api.QueueID(cp.rootQueue)]
rootQueueAttr.capability = cp.totalResource
if rootQueueAttr.capability == nil || rootQueueAttr.capability.IsEmpty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rootQueueAttr.capability.IsEmpty() is enough because the capability of root queue will be initialized in newQueueAttr, and it won't be nil

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion, done.

Copy link
Member

@JesseStutler JesseStutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sqush to one commit, thanks :)

@houyuting houyuting requested a review from JesseStutler June 10, 2025 02:45
@JesseStutler
Copy link
Member

image
@houyuting Please squash to one commit :)

@JesseStutler
Copy link
Member

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 10, 2025
@JesseStutler
Copy link
Member

/lgtm
Please also take a look @hwdef @lowang-bh @Monokaix I think it's reasonable when users want to pre-set the capability for the root queue

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 10, 2025
Comment on lines 604 to 605
rootQueueAttr.realCapability = cp.totalResource
rootQueueAttr.deserved = cp.totalResource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle realCapability and deseverd too? if realCapability and deseverd not empty and less than cp.totalResource may occurs error ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think realCapability and deseverd == cp.totalResource is suitable,because we need at least a field that identifies how many resources are actually available in the cluster and I haven't thought of any scenarios where these two fields need to be customized. Or do you have some ideas? We can discuss it together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think realCapability and deseverd == cp.totalResource is suitable,because we need at least a field that identifies how many resources are actually available in the cluster and I haven't thought of any scenarios where these two fields need to be customized. Or do you have some ideas? We can discuss it together.

yes, i think u r right, the realcapacity and deseverd will be caculated in checkHierarchicalQueue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that if users previously set deserved fields of each sub queue, and then deserved field check may also fail when cluster resources reduced.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, done, please review again, thanks.

@Monokaix
Copy link
Member

That's a good catch, and @Xu-Wentao also has the same requirement, and I think that we can add root queue at helm chart, then user can set the spec of root queue when install volcano, and we can remove root queue creation operation in volcano scheduler.

@houyuting
Copy link
Author

@Monokaix Good idea and we can open another feature issue and track it. If you have time,please give me approve, thanks.

@JesseStutler
Copy link
Member

@Monokaix Good idea and we can open another feature issue and track it. If you have time,please give me approve, thanks.

OK @houyuting Would you like to raise this issue? Otherwise we may miss it :)

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 30, 2025
@houyuting houyuting requested a review from Monokaix July 1, 2025 02:44
@Monokaix
Copy link
Member

Monokaix commented Jul 1, 2025

Please sign off your commit with git commit -s.

@Monokaix
Copy link
Member

Monokaix commented Jul 1, 2025

The root queue will also be updated when close session, so it's not enough to just modify plugin.

@houyuting
Copy link
Author

The root queue will also be updated when close session, so it's not enough to just modify plugin.

got it

@houyuting houyuting force-pushed the fix_4350 branch 2 times, most recently from 93f4a7b to e9e096a Compare July 9, 2025 02:53
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2025
@volcano-sh-bot volcano-sh-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 11, 2025
@JesseStutler
Copy link
Member

image The UT failed, please fix it thanks

@JesseStutler
Copy link
Member

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements the intended change to only set the root queue's capability and deserved resources if they are not already defined by the user. The changes are logical and the new test cases are a good addition.

I've pointed out a couple of areas for improvement. One is a minor comment update for maintainability. The other is a more significant concern regarding the removal of logic that caps a queue's deserved resources by its realCapability in a hierarchical setup. This could lead to inconsistencies and scheduling issues, so it's worth a second look. Overall, great work on addressing the issue.

@@ -598,9 +599,13 @@ func (cp *capacityPlugin) buildHierarchicalQueueAttrs(ssn *framework.Session) bo

// init root queue realCapability/capability/deserved as cp.totalResource

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on this line is now outdated due to the changes below. It states that capability and deserved are initialized as cp.totalResource, but this is now done conditionally. Please update the comment to reflect the new logic for better code maintainability.

Suggested change
// init root queue realCapability/capability/deserved as cp.totalResource
// init root queue: realCapability is set to total resource, and capability/deserved are also set if empty.

@@ -370,6 +370,7 @@ func Test_capacityPlugin_OnSessionOpenWithHierarchy(t *testing.T) {
pg1 := util.BuildPodGroup("pg1", "ns1", "q11", 1, nil, schedulingv1beta1.PodGroupInqueue)
// queue
root := buildQueueWithParents("root", "", nil, nil)
root1 := buildQueueWithParents("root", "", nil, api.BuildResourceList("16", "16Gi"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It seems like root1 is overriding root queue, but root1 is not used anywhere in the test. Is this intentional?

@@ -468,6 +469,8 @@
pg13 := util.BuildPodGroup("pg13", "ns1", "q91", 1, nil, schedulingv1beta1.PodGroupInqueue)
// pod
p13 := util.BuildPod("ns1", "p13", "", corev1.PodPending, api.BuildResourceList("2", "2Gi", []api.ScalarResource{{Name: "nvidia.com/gpu", Value: "1"}}...), "pg13", make(map[string]string), make(map[string]string))
// queue
queue10 := buildQueueWithParents("q10", "root", nil, api.BuildResourceList("10", "4Gi", []api.ScalarResource{}...))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It seems like queue10 is not used anywhere in the test. Is this intentional?

Copy link
Member

@JesseStutler JesseStutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my latest review comments and gemini's comments, thanks

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 21, 2025
go.sum Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think go.mod and go.sum should not change in your pr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign k82cn
You can assign the PR to them by writing /assign @k82cn in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 22, 2025
@@ -774,8 +780,14 @@ func (cp *capacityPlugin) checkHierarchicalQueue(attr *queueAttr) error {
}

if attr.name == cp.rootQueue {
attr.guarantee = totalGuarantee
cp.totalGuarantee = totalGuarantee
if attr.guarantee.IsEmpty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root queue is created by yourself or auto created?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto created

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then its guarantee and capability fields will be updated once created, so seems this can not take effect.

@JesseStutler
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

hierarychical queues validation can not pass although root queue capability greater than sub queue capability
5 participants