-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
hcc/jiratheme/batchIssues related to batch jobs and schedulingIssues related to batch jobs and schedulingtheme/schedulingtype/bug
Description
Nomad scheduling of batch job allocations is currently inconsistent with the documented behavior. From the documentation, batch job allocations should behave in the following ways:
- when stopped with
nomad alloc stop
- the allocation should be rescheduled - when any task statuses become
failed
- the allocation should be rescheduled - when drained - the allocation should not be replaced (allocation is allowed to complete, or killed if deadline reached)
Currently the drain behavior is not working as documented.
To document the current behavior, a cluster with 3 agents will be used along with the simple jobspec below defining a batch job:
batch jobspec
job "sleep-job" {
type = "batch"
group "sleeper" {
count = 5
ephemeral_disk {
size = 10
}
task "do_sleep" {
driver = "raw_exec"
logs {
disabled = true
max_files = 1
max_file_size = 1
}
config {
command = "sleep"
args = ["1d"]
}
resources {
memory = 10
cpu = 5
}
}
task "extra_sleep" {
driver = "raw_exec"
logs {
disabled = true
max_files = 1
max_file_size = 1
}
config {
command = "sleep"
args = ["1d"]
}
resources {
memory = 10
cpu = 5
}
}
}
}
drain behavior
Running the job we get an initial status:
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
sleeper 0 0 5 0 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
25f432b7 490b97bb sleeper 0 run running 3s ago 2s ago
a8eb00d4 717d40fd sleeper 0 run running 3s ago 2s ago
d05c5866 52a010ff sleeper 0 run running 3s ago 2s ago
dffa4043 490b97bb sleeper 0 run running 3s ago 2s ago
ec349a28 52a010ff sleeper 0 run running 3s ago 2s ago
Now, draining node 490b97bb
with a deadline of 2s results in:
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
sleeper 0 0 5 2 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
7ab9320e 52a010ff sleeper 0 run running 3s ago 2s ago
ccb7284f 717d40fd sleeper 0 run running 3s ago 2s ago
25f432b7 490b97bb sleeper 0 stop failed 2m6s ago 3s ago
a8eb00d4 717d40fd sleeper 0 run running 2m6s ago 2m5s ago
d05c5866 52a010ff sleeper 0 run running 2m6s ago 2m5s ago
dffa4043 490b97bb sleeper 0 stop failed 2m6s ago 2s ago
ec349a28 52a010ff sleeper 0 run running 2m6s ago 2m5s ago
The two allocations which were running on node 490b97bb
have a status of failed
and were rescheduled. The expected behavior should be the two allocations having a status of complete
and not being rescheduled.
Metadata
Metadata
Assignees
Labels
hcc/jiratheme/batchIssues related to batch jobs and schedulingIssues related to batch jobs and schedulingtheme/schedulingtype/bug