-
Notifications
You must be signed in to change notification settings - Fork 2k
[scheduler] fix scheduling behavior of batch job allocs #26961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Allocations of batch jobs have two specific behaviors documented: First, on node drain, the allocation is allowed to complete unless the deadline is reached at which point the allocation is killed. The allocation is note replaced. Second, when using the `alloc stop` command, the allocation is stopped and then rescheduled according to its reschedule policy. This update removes the change introduced in dfa07e1 (#26025) that forced batch job allocations into a failed state when migrating. The reported issue it was attempting to resolve was itself incorrect behavior. The reconciler has been adjusted to properly handle batch job allocations as documented.
scheduler/reconciler/filters.go
Outdated
remaining = make(allocSet) | ||
for id, alloc := range set { | ||
if !alloc.ServerTerminalStatus() { | ||
if (alloc.Job.Type == structs.JobTypeBatch && !alloc.DesiredTransition.ShouldReschedule()) || !alloc.ServerTerminalStatus() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're keeping batch allocs if they're server-terminal and don't have desired-transition reschedule. Is this because of nomad alloc stop
? I don't think those allocs are actually server-terminal until after they've already been through the scheduler once.
In any case, this weird conditional could definitely use a "why" comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct that this is because of alloc stop
. Without this addition to the conditional, when the future eval is run, no allocation will be placed because any existing complete allocations will be counted for the total. Filtering out those that are marked for being rescheduled allows them to actually be placed when the eval is run.
nomad/structs/structs.go
Outdated
|
||
if (a.DesiredStatus == AllocDesiredStatusStop && !a.LastRescheduleFailed()) || | ||
(a.ClientStatus != AllocClientStatusFailed && a.ClientStatus != AllocClientStatusLost) || | ||
(!isBatch && a.ClientStatus != AllocClientStatusFailed && a.ClientStatus != AllocClientStatusLost) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have a batch alloc that's complete, but not yet stopped on the server, this change will mean NextRescheduleTime
potentially returns true for the eval where we process that update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted this to check for rescheduled batch.
as = as.filterByTerminal() | ||
desiredChanges := new(structs.DesiredUpdates) | ||
desiredChanges.Stop, allocsToStop = as.filterAndStopAll(a.clusterState) | ||
// TODO(spox): what is with allocsToStop here? not appended, only last set returned? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes, that seems wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is just a note for me to investigate a bit and spin out a separate PR.
Description
Allocations of batch jobs have two specific behaviors documented:
First, on node drain, the allocation is allowed to complete unless
the deadline is reached at which point the allocation is killed. The
allocation is note replaced.
Second, when using the
alloc stop
command, the allocation isstopped and then rescheduled according to its reschedule policy.
This update removes the change introduced in dfa07e1 (#26025)
that forced batch job allocations into a failed state when
migrating. The reported issue it was attempting to resolve was
itself incorrect behavior. The reconciler has been adjusted
to properly handle batch job allocations as documented.
An important addition to note: new eval trigger reason
This is added to provide better information to the user. It is
shown an explained in the last examples below.
Testing & Reproduction steps
batch jobspec
Behavior on main
alloc stop command
This shows the behavior of the
alloc stop
command on a batch job allocation. The job is started and then a single allocation is stopped:Here we can see the result of the
alloc stop
command is the allocation is stopped in a failed state and the allocation is immediately replaced. The desired behavior here is that the allocation should be stopped with acomplete
status, and the allocation should be rescheduled based on the reschedule policy.drain behavior
This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:
The drain stops the two allocations on the node in a failed state, and immediately places two new allocations. For drains, the allocations should be stopped with a
complete
status and the allocations should not be replaced.Behavior with this changeset
alloc stop command
Now the allocation is stopped, in a complete state, and a new allocation hasn't immediately replaced it. Instead, the allocation has been rescheduled based on the reschedule policy as expected from the documented behavior. Once the delayed evaluation is executed, the new allocation is placed.
drain behavior
This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:
The drain stops the two allocations on the node in a completed state, and the allocations are not replaced. This matches the documented expected behavior.
New evaluation trigger reason
The current behavior of nomad when rescheduling an allocation is to assume the allocation being replaced has failed. When stopping an allocation, this results in an
eval status
with the following:The
TriggeredBy
insinuates that the eval was triggered by the allocation failing, but it was triggered by the allocation being rescheduled due to thealloc stop
command. To more correctly describe the reason, theEvalTriggerAllocReschedule
constant was introduced and used in this situation, which gives the valuealloc-reschedule
as shown below:Links
Fixes #26929
Contributor Checklist
changelog entry using the
make cl
command.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.