Skip to content

Conversation

chrisroberts
Copy link
Member

@chrisroberts chrisroberts commented Oct 18, 2025

Description

Allocations of batch jobs have two specific behaviors documented:

First, on node drain, the allocation is allowed to complete unless
the deadline is reached at which point the allocation is killed. The
allocation is note replaced.

Second, when using the alloc stop command, the allocation is
stopped and then rescheduled according to its reschedule policy.

This update removes the change introduced in dfa07e1 (#26025)
that forced batch job allocations into a failed state when
migrating. The reported issue it was attempting to resolve was
itself incorrect behavior. The reconciler has been adjusted
to properly handle batch job allocations as documented.

An important addition to note: new eval trigger reason

  • EvalTriggerAllocReschedule

This is added to provide better information to the user. It is
shown an explained in the last examples below.

Testing & Reproduction steps

batch jobspec
job "sleep-job" {
  type = "batch"

  group "sleeper" {
    count = 5

    reschedule {
      attempts       = 3
      interval       = "15m"
      delay          = "4m"
      delay_function = "constant"
      max_delay      = "5m"
      unlimited      = false
    }

    ephemeral_disk {
      size = 10
    }

    task "do_sleep" {
      driver = "raw_exec"

      logs {
        disabled      = true
        max_files     = 1
        max_file_size = 1
      }

      config {
        command = "sleep"
        args    = ["1d"]
      }

      resources {
        memory = 10
        cpu    = 5
      }
    }

    task "extra_sleep" {
      driver = "raw_exec"

      logs {
        disabled      = true
        max_files     = 1
        max_file_size = 1
      }

      config {
        command = "sleep"
        args    = ["2d"]
      }

      resources {
        memory = 10
        cpu    = 5
      }
    }
  }
}

Behavior on main

alloc stop command

This shows the behavior of the alloc stop command on a batch job allocation. The job is started and then a single allocation is stopped:

➜ nomad run sleep.hcl

==> View this job in the Web UI: http://10.86.244.24:4646/ui/jobs/sleep-job@default

==> 2025-10-17T17:51:06-07:00: Monitoring evaluation "40250ff8"
    2025-10-17T17:51:06-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:51:07-07:00: Allocation "71d6882e" created: node "0e569f27", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "8e671f60" created: node "0e569f27", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "c72be233" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "ca3f8856" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "421b7a60" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:51:07-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:51:07-07:00: Evaluation "40250ff8" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:51:06-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
421b7a60  b0dccea3  sleeper     0        run      running  3s ago   2s ago
71d6882e  0e569f27  sleeper     0        run      running  3s ago   2s ago
8e671f60  0e569f27  sleeper     0        run      running  3s ago   2s ago
c72be233  b0dccea3  sleeper     0        run      running  3s ago   2s ago
ca3f8856  6c4fcb70  sleeper     0        run      running  3s ago   2s ago

➜ nomad alloc stop 42
==> 2025-10-17T17:51:31-07:00: Monitoring evaluation "855d8b1a"
    2025-10-17T17:51:31-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:51:32-07:00: Allocation "8b1af122" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:51:32-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:51:32-07:00: Evaluation "855d8b1a" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:51:06-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        1       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
8b1af122  6c4fcb70  sleeper     0        run      running  3s ago   2s ago
421b7a60  b0dccea3  sleeper     0        stop     failed   29s ago  3s ago
71d6882e  0e569f27  sleeper     0        run      running  29s ago  28s ago
8e671f60  0e569f27  sleeper     0        run      running  29s ago  28s ago
c72be233  b0dccea3  sleeper     0        run      running  29s ago  28s ago
ca3f8856  6c4fcb70  sleeper     0        run      running  29s ago  28s ago

Here we can see the result of the alloc stop command is the allocation is stopped in a failed state and the allocation is immediately replaced. The desired behavior here is that the allocation should be stopped with a complete status, and the allocation should be rescheduled based on the reschedule policy.

drain behavior

This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:

➜ nomad run sleep.hcl

==> View this job in the Web UI: http://10.86.244.24:4646/ui/jobs/sleep-job@default

==> 2025-10-17T17:58:19-07:00: Monitoring evaluation "28b04ae3"
    2025-10-17T17:58:19-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:58:20-07:00: Allocation "8841e305" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "de029dc7" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "f33973b8" created: node "0e569f27", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "2d9fb037" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "733eb34d" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:58:20-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:58:20-07:00: Evaluation "28b04ae3" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:58:19-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
2d9fb037  b0dccea3  sleeper     0        run      running  4s ago   3s ago
733eb34d  b0dccea3  sleeper     0        run      running  4s ago   3s ago
8841e305  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
de029dc7  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
f33973b8  0e569f27  sleeper     0        run      running  4s ago   3s ago


➜ nomad node drain -enable -yes -deadline 1s b0
2025-10-17T17:58:36-07:00: Ctrl-C to stop monitoring: will not cancel the node drain
2025-10-17T17:58:36-07:00: Node "b0dccea3-ab06-6141-474b-05f5892f72b8" drain strategy set
2025-10-17T17:58:38-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" marked for migration
2025-10-17T17:58:38-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" marked for migration
2025-10-17T17:58:38-07:00: Drain complete for node b0dccea3-ab06-6141-474b-05f5892f72b8
2025-10-17T17:58:38-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" draining
2025-10-17T17:58:38-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" draining
2025-10-17T17:58:39-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" status running -> failed
2025-10-17T17:58:39-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" status running -> failed
2025-10-17T17:58:39-07:00: All allocations on node "b0dccea3-ab06-6141-474b-05f5892f72b8" have stopped

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:58:19-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        2       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
10065b8b  0e569f27  sleeper     0        run      running  5s ago   4s ago
9d99b920  0e569f27  sleeper     0        run      running  5s ago   4s ago
2d9fb037  b0dccea3  sleeper     0        stop     failed   25s ago  5s ago
733eb34d  b0dccea3  sleeper     0        stop     failed   25s ago  5s ago
8841e305  6c4fcb70  sleeper     0        run      running  25s ago  24s ago
de029dc7  6c4fcb70  sleeper     0        run      running  25s ago  24s ago
f33973b8  0e569f27  sleeper     0        run      running  25s ago  24s ago

The drain stops the two allocations on the node in a failed state, and immediately places two new allocations. For drains, the allocations should be stopped with a complete status and the allocations should not be replaced.

Behavior with this changeset

alloc stop command
➜ nomad run sleep.hcl

==> 2025-10-20T08:10:34-07:00: Monitoring evaluation "d89ce708"
    2025-10-20T08:10:34-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:10:35-07:00: Allocation "05ad7436" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "7a1b5420" created: node "0e569f27", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "995f5e33" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "a5fd7420" created: node "0e569f27", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "c5c12c43" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:10:35-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:10:35-07:00: Evaluation "d89ce708" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
05ad7436  6c4fcb70  sleeper     0        run      running  3s ago   2s ago
7a1b5420  0e569f27  sleeper     0        run      running  3s ago   2s ago
995f5e33  b0dccea3  sleeper     0        run      running  3s ago   2s ago
a5fd7420  0e569f27  sleeper     0        run      running  3s ago   2s ago
c5c12c43  6c4fcb70  sleeper     0        run      running  3s ago   2s ago

➜ nomad alloc stop 05
==> 2025-10-20T08:10:43-07:00: Monitoring evaluation "abb43bda"
    2025-10-20T08:10:43-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:10:44-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:10:44-07:00: Evaluation "abb43bda" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         4        0       1         0     0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
sleeper     63d25748  3m47s from now

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
05ad7436  6c4fcb70  sleeper     0        stop     complete  14s ago  4s ago
7a1b5420  0e569f27  sleeper     0        run      running   14s ago  13s ago
995f5e33  b0dccea3  sleeper     0        run      running   14s ago  13s ago
a5fd7420  0e569f27  sleeper     0        run      running   14s ago  13s ago
c5c12c43  6c4fcb70  sleeper     0        run      running   14s ago  13s ago

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       1         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
0befef56  b0dccea3  sleeper     0        run      running   3m56s ago  3m55s ago
05ad7436  6c4fcb70  sleeper     0        stop     complete  7m57s ago  7m47s ago
7a1b5420  0e569f27  sleeper     0        run      running   7m57s ago  7m56s ago
995f5e33  b0dccea3  sleeper     0        run      running   7m57s ago  7m56s ago
a5fd7420  0e569f27  sleeper     0        run      running   7m57s ago  7m56s ago
c5c12c43  6c4fcb70  sleeper     0        run      running   7m57s ago  7m56s ago

Now the allocation is stopped, in a complete state, and a new allocation hasn't immediately replaced it. Instead, the allocation has been rescheduled based on the reschedule policy as expected from the documented behavior. Once the delayed evaluation is executed, the new allocation is placed.

drain behavior

This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:

➜ nomad run sleep.hcl

==> 2025-10-20T08:21:36-07:00: Monitoring evaluation "ad5b6d81"
    2025-10-20T08:21:36-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:21:37-07:00: Allocation "f7af18cc" created: node "0e569f27", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "7386d7b1" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "8392ca41" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "8765c6ba" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "d647f127" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:21:37-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:21:37-07:00: Evaluation "ad5b6d81" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:21:36-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
7386d7b1  b0dccea3  sleeper     0        run      running  4s ago   3s ago
8392ca41  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
8765c6ba  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
d647f127  b0dccea3  sleeper     0        run      running  4s ago   3s ago
f7af18cc  0e569f27  sleeper     0        run      running  4s ago   4s ago

➜ nomad node drain -enable -yes -deadline 1s b0
2025-10-20T08:22:11-07:00: Ctrl-C to stop monitoring: will not cancel the node drain
2025-10-20T08:22:11-07:00: Node "b0dccea3-ab06-6141-474b-05f5892f72b8" drain strategy set
2025-10-20T08:22:13-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" marked for migration
2025-10-20T08:22:13-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" marked for migration
2025-10-20T08:22:13-07:00: Drain complete for node b0dccea3-ab06-6141-474b-05f5892f72b8
2025-10-20T08:22:14-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" draining
2025-10-20T08:22:14-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" draining
2025-10-20T08:22:14-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" status running -> complete
2025-10-20T08:22:14-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" status running -> complete
2025-10-20T08:22:14-07:00: All allocations on node "b0dccea3-ab06-6141-474b-05f5892f72b8" have stopped

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:21:36-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         3        0       2         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
7386d7b1  b0dccea3  sleeper     0        stop     complete  41s ago  4s ago
8392ca41  6c4fcb70  sleeper     0        run      running   41s ago  40s ago
8765c6ba  6c4fcb70  sleeper     0        run      running   41s ago  40s ago
d647f127  b0dccea3  sleeper     0        stop     complete  41s ago  4s ago
f7af18cc  0e569f27  sleeper     0        run      running   41s ago  41s ago

The drain stops the two allocations on the node in a completed state, and the allocations are not replaced. This matches the documented expected behavior.

New evaluation trigger reason

The current behavior of nomad when rescheduling an allocation is to assume the allocation being replaced has failed. When stopping an allocation, this results in an eval status with the following:

➜ nomad eval status 8dd
ID                 = 8dde8bd1
Create Time        = 24s ago
Modify Time        = 24s ago
Status             = pending
Status Description = created for delayed rescheduling
Type               = batch
TriggeredBy        = alloc-failure
Job ID             = sleep-job
Namespace          = default
...

The TriggeredBy insinuates that the eval was triggered by the allocation failing, but it was triggered by the allocation being rescheduled due to the alloc stop command. To more correctly describe the reason, the EvalTriggerAllocReschedule constant was introduced and used in this situation, which gives the value alloc-reschedule as shown below:

➜ nomad eval status 440
ID                 = 44058981
Create Time        = 10s ago
Modify Time        = 10s ago
Status             = pending
Status Description = created for delayed rescheduling
Type               = batch
TriggeredBy        = alloc-reschedule
Job ID             = sleep-job
Namespace          = default
...

Links

Fixes #26929

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

Allocations of batch jobs have two specific behaviors documented:

First, on node drain, the allocation is allowed to complete unless
the deadline is reached at which point the allocation is killed. The
allocation is note replaced.

Second, when using the `alloc stop` command, the allocation is
stopped and then rescheduled according to its reschedule policy.

This update removes the change introduced in dfa07e1 (#26025)
that forced batch job allocations into a failed state when
migrating. The reported issue it was attempting to resolve was
itself incorrect behavior. The reconciler has been adjusted
to properly handle batch job allocations as documented.
remaining = make(allocSet)
for id, alloc := range set {
if !alloc.ServerTerminalStatus() {
if (alloc.Job.Type == structs.JobTypeBatch && !alloc.DesiredTransition.ShouldReschedule()) || !alloc.ServerTerminalStatus() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're keeping batch allocs if they're server-terminal and don't have desired-transition reschedule. Is this because of nomad alloc stop? I don't think those allocs are actually server-terminal until after they've already been through the scheduler once.

In any case, this weird conditional could definitely use a "why" comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct that this is because of alloc stop. Without this addition to the conditional, when the future eval is run, no allocation will be placed because any existing complete allocations will be counted for the total. Filtering out those that are marked for being rescheduled allows them to actually be placed when the eval is run.


if (a.DesiredStatus == AllocDesiredStatusStop && !a.LastRescheduleFailed()) ||
(a.ClientStatus != AllocClientStatusFailed && a.ClientStatus != AllocClientStatusLost) ||
(!isBatch && a.ClientStatus != AllocClientStatusFailed && a.ClientStatus != AllocClientStatusLost) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have a batch alloc that's complete, but not yet stopped on the server, this change will mean NextRescheduleTime potentially returns true for the eval where we process that update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted this to check for rescheduled batch.

as = as.filterByTerminal()
desiredChanges := new(structs.DesiredUpdates)
desiredChanges.Stop, allocsToStop = as.filterAndStopAll(a.clusterState)
// TODO(spox): what is with allocsToStop here? not appended, only last set returned?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, that seems wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is just a note for me to investigate a bit and spin out a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: incorrect scheduling of batch job allocations on drain

2 participants