scheduler: perform feasibility checks for system canaries before computing placements #26953

pkazmierczak · 2025-10-15T18:20:42Z

Canaries for system jobs are placed on a tg.update.canary percent of eligible nodes. Some of these nodes may not be feasible, and until now we removed infeasible nodes during placement computation. However, if it happens to be that the first eligible node we picked to place a canary on is infeasible, this will lead to the scheduler halting deployment.

The solution presented here simplifies canary deployments: initially, system jobs that use canary updates get allocations placed on all eligible nodes, but before we start computing actual placements, a method called evictCanaries is called (much like evictAndPlace is for honoring MaxParallel), and performs a feasibility check on each node up to the amount of required canaries per task group. Feasibility checks are expensive, but this way we only check all the nodes in the worst case scenario (with canary=100), otherwise we stop checks once we know we are ready to place enough canaries.

…uting placements Canaries for system jobs are placed on a tg.update.canary percent of eligible nodes. Some of these nodes may not be feasible, and until now we removed infeasible nodes during placement computation. However, if it happens to be that the first eligible node we picked to place a canary on is infeasible, this will lead to the scheduler halting deployment. The solution presented here simplifies canary deployments: initially, system jobs that use canary updates get allocations placed on all eligible nodes, but before we start computing actual placements, a method called `evictCanaries` is called (much like `evictAndPlace` is for honoring MaxParallel), and performs a feasibility check on each node up to the amount of required canaries per task group. Feasibility checks are expensive, but this way we only check all the nodes in the worst case scenario (with canary=100), otherwise we stop checks once we know we are ready to place enough canaries.

tgross · 2025-10-17T19:05:53Z

scheduler/scheduler_system.go

+		// we only now the total amountn of placements once we filter out
+		// infeasible nodes, so for system jobs we do it backwards a bit: the
+		// "desired" total is the total we were able to place.
+		if s.deployment != nil {
+			s.deployment.TaskGroups[tgName].DesiredTotal += 1
+		}


For system jobs I think we need to make sure we're working from a blank slate dstate for each evaluation. Incrementing this here is adding on top of the desired total from previous evals.

tgross · 2025-10-17T19:20:56Z

scheduler/scheduler_system.go

+	// ensure everything is healthy
+	if dstate, ok := s.deployment.TaskGroups[groupName]; ok {
+		if dstate.HealthyAllocs < dstate.DesiredTotal { // Make sure we have enough healthy allocs
+			complete = false
+		}
+	}


If we're resetting desired total in computePlacements, it won't be correctly set when we reach isDeploymentComplete, which is called before that.

vercel bot deployed to Preview – nomad-ui October 15, 2025 18:21 View deployment

handle error

a4523ec

vercel bot deployed to Preview – nomad-ui October 15, 2025 18:23 View deployment

refactor

8e10363

vercel bot deployed to Preview – nomad-ui October 16, 2025 16:37 View deployment

pkazmierczak added 2 commits October 17, 2025 17:41

cleanup

a5a96e1

some good ideas, I think

b057f84

vercel bot deployed to Preview – nomad-ui October 17, 2025 15:43 View deployment

tgross reviewed Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scheduler: perform feasibility checks for system canaries before computing placements #26953

scheduler: perform feasibility checks for system canaries before computing placements #26953

pkazmierczak commented Oct 15, 2025

Uh oh!

tgross Oct 17, 2025

Uh oh!

tgross Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scheduler: perform feasibility checks for system canaries before computing placements #26953

Are you sure you want to change the base?

scheduler: perform feasibility checks for system canaries before computing placements #26953

Conversation

pkazmierczak commented Oct 15, 2025

Uh oh!

tgross Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants