Description
Nomad version
1.9.10+ent
Operating system and Environment details
Fedora 42, AWS
Issue
We have a system job for nodes of a certain job that we are scaling based on the conditions of a external work queue. When that queue is empty we would like to scale to zero. Scaling works, but the system job displays as "failed" when we scale the node pools to 0 and the job has zero allocations (which is a desirable state).
This issue was even pointed out in #24620, which introduced this bug after #23829 fixed it :
While there are problems with showing an alloc-less system job as "Failed" (it may in fact have been deliberately "scaled down" by virtue of taking all eligible nodes offline, for example), this seems like it's by far the most common reason you'd have an un-GC'd system job with no allocs, and so "Scaled Down" shouldn't be the default label for it. This PR makes it so system/sysbatch jobs cannot be labelled "Scaled Down" as such.
Reading further, it's clear there are a lot of edge cases here, so I understand the core issue of interpreting the meaning of 0 allocations is complicated. I'd wager the real problem is a lack of failure information conveyed by the Nomad API in the job data model. Any notion of "failure" seems to be something the UI or other API user has to inferred from data the API provides, which makes sense given the number of things that can happen to allocations in different scenarios. But it does yield a confusing UI experience.
The only solution I can think of would be to check recent allocations for success/failure and make a call based on that.
Reproduction steps
- Create a node pool with no clients attached
- Create a system job assigned to that node pool
- Observe the UI for that job
Expected Result
Job shows up in UI as "Scaled Down" or some other non-failure state.
Actual Result
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Metadata
Metadata
Assignees
Type
Projects
Status