Skip to content

System jobs on node pools scaled to 0 show up in UI as failed #26155

Open
@jriddy

Description

@jriddy

Nomad version

1.9.10+ent

Operating system and Environment details

Fedora 42, AWS

Issue

We have a system job for nodes of a certain job that we are scaling based on the conditions of a external work queue. When that queue is empty we would like to scale to zero. Scaling works, but the system job displays as "failed" when we scale the node pools to 0 and the job has zero allocations (which is a desirable state).

This issue was even pointed out in #24620, which introduced this bug after #23829 fixed it :

While there are problems with showing an alloc-less system job as "Failed" (it may in fact have been deliberately "scaled down" by virtue of taking all eligible nodes offline, for example), this seems like it's by far the most common reason you'd have an un-GC'd system job with no allocs, and so "Scaled Down" shouldn't be the default label for it. This PR makes it so system/sysbatch jobs cannot be labelled "Scaled Down" as such.

Reading further, it's clear there are a lot of edge cases here, so I understand the core issue of interpreting the meaning of 0 allocations is complicated. I'd wager the real problem is a lack of failure information conveyed by the Nomad API in the job data model. Any notion of "failure" seems to be something the UI or other API user has to inferred from data the API provides, which makes sense given the number of things that can happen to allocations in different scenarios. But it does yield a confusing UI experience.

The only solution I can think of would be to check recent allocations for success/failure and make a call based on that.

Reproduction steps

  1. Create a node pool with no clients attached
  2. Create a system job assigned to that node pool
  3. Observe the UI for that job

Expected Result

Job shows up in UI as "Scaled Down" or some other non-failure state.

Actual Result

Image

Image

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions