Skip to content

Batch jobs using vault fail with error "unable to find token for workload" #25955

Open
@Zarickan

Description

@Zarickan

Nomad version

Nomad v1.10.1
BuildDate 2025-05-13T07:40:43Z
Revision 3431f13e8036b4716aac0e3b8c5854ddca212e5c

Operating system and Environment details

Debian GNU/Linux 13 (trixie) x86_64, 6.9.9-amd64
Nomad installed through apt: nomad/bookworm,now 1.10.1-1 amd64

Issue

Unable to run any periodic/batch jobs that use Vault due to weird error:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s

From what I can tell the error comes from here, perhaps it is related to the mentioned race condition?

// This is an error as every identity should have a token by the time Get
// is called.
return nil, fmt.Errorf("unable to find token for workload %q and identity %q", id.WorkloadIdentifier, id.IdentityName)

This is happening with all the periodic jobs I have and seem to happen regardless of what the job spedcification looks like, as long as it contains a vault section. The issue appeared without me consiously changing anything in Nomad, Vault, or the job specification, as the job just started failing on their own.

Reproduction steps

  1. Submit the job to nomad
  2. Force launch a periodic job
  3. Job should be stuck in either pending or recovering (depending on how long it is left alone) state with the error above in the nomad logs

Expected Result

Job starts and runs without issue.

Actual Result

Started job (and any future instances of the periodic job) are stuck in pending state forever with the mentioned error in the nomad logs.

Job file (if appropriate)

Reproducible with this minimal job:

job "periodic-repro" {
  region      = "dk"
  datacenters = ["dk1"]
  type        = "batch"
  namespace   = "monitor"

  periodic {
    cron             = "0 0 * * *"
  }

  group "periodic-repro" {
    count = 1

    task "periodic-repro" {
      driver = "docker"

      config {
        image        = "busybox:latest"
        network_mode = "host"
        command      = "sh"
        args         = ["-c", "echo 'Hello World' && sleep 120"]
      }

      resources {
        cpu    = 100
        memory = 256
      }

      vault {
        policies    = ["monitor-nrgi"]
        change_mode = "restart"
      }
    }
  }
}

Nomad Server logs (if appropriate)

These are the logs from my nomad server which is also the client the job runs on:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type=Received msg="Task received by client" failed=false
[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro type="Task Setup" msg="Building Task Directory" failed=false
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=5s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=10s
[ERROR] client.alloc_runner.task_runner.task_hook.vault: failed to derive Vault token: alloc_id=91a315b9-dfc6-0353-fe7f-d0e2e540a323 task=periodic-repro error="failed to retrieve signed workload identity: unable to find token for workload \"periodic-repro\" and identity \"vault_default\"" recoverable=true backoff=20s

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions