Skip to content

Client metadata updates cause evals on completed jobs, then they never GC #26181

Open
@matkinson-godaddy

Description

@matkinson-godaddy

Nomad version

Nomad v1.10.2
BuildDate 2025-06-09T22:00:49Z
Revision df4c764+CHANGES

But we also experienced this in 1.9.6, we were hoping upgrading to 1.10.2 would fix it.

Operating system and Environment details

AlmaLinux 9.6 and MacOS 15.5

Issue

We are having a strange issue for the last few months where all our batch jobs are not GC'ing causing our nomad server memory usage to ballon. To resolve the issue for now we have been manually running nomad system gc which will clear them out.

After much investigation, we noticed that these batch jobs on the evaluations page in the UI had a new evaluation every single hour. We went on the client and monitored the logs at TRACE level and when this evaluation happened these logs occurred:

2025-07-01T21:15:44.097Z [DEBUG] http: request complete: method=POST path="/v1/client/metadata?namespace=system&region=p3" duration=2.080858ms
2025-07-01T21:15:49.547Z [DEBUG] client: state changed, updating node and re-registering
2025-07-01T21:15:49.576Z [DEBUG] client: evaluations triggered by node registration: num_evals=82

This lead me to believe it had something to do with the metadata. So locally on my dev cluster, I tested running a short lived batch job with gc set to 1m, and then I added a new dynamic metadata to the client and changed it over and over again. Every time I changed it, it created a new evaluation and it would never gc.

After further research, this is because we have a dynamic metadata update hourly now and this hourly update is shorter then the default 4h gc setting.

This seems like a pretty bad bug that could potentially lead to a lot of dead/completed jobs never being GC'ed and a lot of unnecessary evaluations created in large clusters.

Reproduction steps

Create a config file with the following settings:

    "server": {
        "job_gc_interval": "1m",
        "job_gc_threshold": "1m"
    },

Then run a short live batch job:

job "batch_test_1" {
  type        = "batch"

  group "group_test" {
    task "sleep_test" {
      driver = "docker"

      config {
        image          = "busybox"
        auth_soft_fail = true
        command        = "sleep"
        args           = ["5"]
      }
      
      resources {
        cpu = 10
        memory = 10
      }

    }
  }
}

Once the job completes, go add and start updating a dynamic metadata field on the client.

Expected Result

New evaluations should not be created and the job should be GC'ed after 1m.

Actual Result

A new evaluation is created for every metadata update and the job will never be GC'ed until you stop updating the dynamic metadata on the client.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions