[KEP] Introduce MultiKueue Dispatcher API #5410

mszadkow · 2025-05-29T07:37:11Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

The feature aims to improve performance and practicality by reducing the overhead of distributing workloads to all clusters simultaneously, minimizing the risk of duplicate admissions and unnecessary preemptions.
It should prevent triggering autoscaling across multiple worker clusters at the same time.

Which issue(s) this PR fixes:

Fixes #5141

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-05-29T07:37:13Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

netlify · 2025-05-29T07:37:16Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`bee1412`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68714a515eecfa00084faf7b

mszadkow · 2025-05-29T07:37:48Z

/cc @mwielgus @mimowo @tenzen-y

mszadkow · 2025-05-29T07:52:41Z

Ww need to discuss also the granularity of the timeout as mentioned by @mimowo

should the timeout be global, per manager, or per worker

mszadkow · 2025-05-29T08:01:44Z

In my opinion this is not if, but how we deliver those levels for timeout, because we already see at leat 2 scenarios that require different levels.
One was mentioned in #5141 and the other in #3757.

This is one could be more general, timeout for the similar type of large amount of clusters.

Both performance (distributing and keeping 40 copies of workload in cluster informers can be expensive) and practical (trying all 40 clusters at the very same time can lead to lots of unnecessary preemptions).

This one should be more granular, probably on the worker level, different clusters but not many of them.

To prioritize the use of some clusters over others. For example a user may have one cluster with reservations, and one auto-scaled. The user prefers to first try the reservation cluster, and only as a fallback try autoscaling.

keps/693-multikueue/README.md

mimowo · 2025-05-29T11:24:11Z

Let's start with KEP update for this.
/retitle MultiKueue KEP update to introduce MultiKueue Dispatcher API

/release-note-edit

NONE

keps/693-multikueue/README.md

mimowo · 2025-06-27T19:54:51Z

LGTM. I 'm not tagging yet to give @tenzen-y a chance for more comments, and think more about spec vs status thread.

Co-authored-by: Michał Woźniak <[email protected]>

mimowo · 2025-07-01T10:25:49Z

/lgtm
/assign @tenzen-y
for an extra pair of eyes

k8s-ci-robot · 2025-07-01T10:25:56Z

LGTM label has been added.

Git tree hash: 7144aa04b32bba71948a58634c6b96be2f39395e

vladikkuzn · 2025-07-02T23:03:07Z

/assign

mimowo · 2025-07-09T06:34:59Z

@vladikkuzn @mszadkow please address the remaining comment: #5410 (comment)

mimowo · 2025-07-10T13:12:41Z

/lgtm
/approve
Leaving final approval to @tenzen-y
/hold

k8s-ci-robot · 2025-07-10T13:12:48Z

LGTM label has been added.

Git tree hash: ba769afe78d955aebd28fd202bdb634bc6d27a2f

k8s-ci-robot · 2025-07-10T13:12:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, mszadkow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mimowo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y

Thank you for proceeding.

keps/693-multikueue/README.md

tenzen-y · 2025-07-10T15:15:14Z

keps/693-multikueue/README.md

+and 3 additional clusters are nominated, until the workload is admitted or all eligible clusters have been considered.
+This strategy allows for a controlled and gradual expansion of candidate clusters, rather than dispatching the workload to all clusters at once.


What if the .spec.nominatedClusterNames after all eligible clusters have been considered?
Is the field reset to empty?

If yes, it would be better to mention it here.

I would give ownership to the field to the dispatcher to decide when to reset etc.

Here, the dispatcher is implemented by upstream Kueue, right?
IIUC, AllAtOnce and Incremental are implemented by upstream Kueue.

We can have both external dispatchers, and built-in.

This description is for Incremental. So, the dispatcher is upstream one, IIUC.
My question is about the upstream Incremental dispatcher.

As i discussed with Marcin, in the future, hopefully 0.14 we will add parametrizing dispatcher. Then we could have a boolean flag indicating if the dispatcher should auto reject or not.

IIUC they do not get rejected, they stay nominated for admission from point of view of this feature.
Do you want to set the rejection state after another round has expired?

Oh, I see. Thank you for the good call. Could you mention in this proposal about what if the workload could not be scheduled to all clusters in the Incremental dispatcher? Is it just record cluster assignment error in the controller-manager logs?

Good question. I think this would be a sensible extension, but I wouldnt say it is necessary in first iteration. Note that we dont reject the workload until 0.12 with the built in dispatcher.

wdyt @tenzen-y?

I'm ok without Reject state for now.

moved to status, also I checked CEL validation was possible

As i discussed with Marcin, in the future, hopefully 0.14 we will add parametrizing dispatcher. Then we could have a boolean flag indicating if the dispatcher should auto reject or not.

Could you describe what is "parametrizing dispatcher"? Here, what is parameter?

moved to status, also I checked CEL validation was possible

I still think that the nominatedClusterNames should be spec. Please follow #5410 (comment).

keps/693-multikueue/README.md

* .status.clusterName ; .spec.nominatedClusterNames

k8s-ci-robot · 2025-07-10T19:16:29Z

New changes are detected. LGTM label has been removed.

keps/693-multikueue/README.md

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 29, 2025

k8s-ci-robot requested review from gabesaba and PBundyra May 29, 2025 07:37

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 29, 2025

k8s-ci-robot requested review from mimowo, mwielgus and tenzen-y May 29, 2025 07:37

mszadkow commented May 29, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

mimowo reviewed May 29, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

k8s-ci-robot changed the title ~~[Feature] Introduce MultiKueue Dispatcher API~~ MultiKueue KEP update to introduce MultiKueue Dispatcher API May 29, 2025

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 29, 2025

mimowo mentioned this pull request Jun 2, 2025

MultiKueue dispatcher API #5141

Open

3 tasks

mszadkow force-pushed the feat/5141-mk-dsipatcher-api branch from 33b16c8 to a042b4f Compare June 10, 2025 10:55

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 10, 2025

mwielgus reviewed Jun 12, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

mwielgus reviewed Jun 12, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

mwielgus reviewed Jun 12, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

mwielgus reviewed Jun 12, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

mimowo reviewed Jun 13, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

Update keps/693-multikueue/README.md

8042732

Co-authored-by: Michał Woźniak <[email protected]>

k8s-ci-robot assigned tenzen-y and mimowo Jul 1, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 1, 2025

k8s-ci-robot assigned vladikkuzn Jul 2, 2025

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2025

k8s-ci-robot requested review from mimowo, tenzen-y and vladikkuzn July 10, 2025 12:55

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 10, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 10, 2025

tenzen-y reviewed Jul 10, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

tenzen-y reviewed Jul 10, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

Update keps/693-multikueue/README.md

d4b1a3f

* .status.clusterName ; .spec.nominatedClusterNames

vladikkuzn force-pushed the feat/5141-mk-dsipatcher-api branch from f302ecf to d4b1a3f Compare July 10, 2025 19:16

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 10, 2025

k8s-ci-robot requested a review from tenzen-y July 10, 2025 19:16

mszadkow commented Jul 11, 2025

View reviewed changes

keps/693-multikueue/README.md Outdated Show resolved Hide resolved

Update keps/693-multikueue/README.md

bee1412

		and 3 additional clusters are nominated, until the workload is admitted or all eligible clusters have been considered.
		This strategy allows for a controlled and gradual expansion of candidate clusters, rather than dispatching the workload to all clusters at once.

[KEP] Introduce MultiKueue Dispatcher API #5410

Are you sure you want to change the base?

[KEP] Introduce MultiKueue Dispatcher API #5410

Conversation

mszadkow commented May 29, 2025 • edited by k8s-ci-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented May 29, 2025

Uh oh!

netlify bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

mszadkow commented May 29, 2025

Uh oh!

mszadkow commented May 29, 2025

Uh oh!

mszadkow commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

mimowo commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mimowo commented Jun 27, 2025

Uh oh!

mimowo commented Jul 1, 2025

Uh oh!

k8s-ci-robot commented Jul 1, 2025

Uh oh!

vladikkuzn commented Jul 2, 2025

Uh oh!

mimowo commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mimowo commented Jul 10, 2025

Uh oh!

k8s-ci-robot commented Jul 10, 2025

Uh oh!

k8s-ci-robot commented Jul 10, 2025

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tenzen-y Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mszadkow commented May 29, 2025 •

edited by k8s-ci-robot

Loading

netlify bot commented May 29, 2025 •

edited

Loading

mimowo commented Jul 9, 2025 •

edited

Loading

tenzen-y Jul 10, 2025 •

edited

Loading