Description
Nomad version
Nomad v1.10.1
BuildDate 2025-05-13T07:40:43Z
Revision 3431f13e8036b4716aac0e3b8c5854ddca212e5c
Operating system and Environment details
Debian GNU/Linux 13 (trixie) x86_64, 6.9.9-amd64
Nomad installed through apt: nomad/bookworm,now 1.10.1-1 amd64
Issue
Unable to run any periodic/batch jobs that use Vault due to weird error:
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s
From what I can tell the error comes from here, perhaps it is related to the mentioned race condition?
Lines 147 to 149 in 348177d
This is happening with all the periodic jobs I have and seem to happen regardless of what the job spedcification looks like, as long as it contains a vault
section. The issue appeared without me consiously changing anything in Nomad, Vault, or the job specification, as the job just started failing on their own.
Reproduction steps
- Submit the job to nomad
- Force launch a periodic job
- Job should be stuck in either pending or recovering (depending on how long it is left alone) state with the error above in the nomad logs
Expected Result
Job starts and runs without issue.
Actual Result
Started job (and any future instances of the periodic job) are stuck in pending state forever with the mentioned error in the nomad logs.
Job file (if appropriate)
Reproducible with this minimal job:
job "periodic-repro" {
region = "dk"
datacenters = ["dk1"]
type = "batch"
namespace = "monitor"
periodic {
cron = "0 0 * * *"
}
group "periodic-repro" {
count = 1
task "periodic-repro" {
driver = "docker"
config {
image = "busybox:latest"
network_mode = "host"
command = "sh"
args = ["-c", "echo 'Hello World' && sleep 120"]
}
resources {
cpu = 100
memory = 256
}
vault {
policies = ["monitor-nrgi"]
change_mode = "restart"
}
}
}
}
Nomad Server logs (if appropriate)
These are the logs from my nomad server which is also the client the job runs on:
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO] client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s