Pods keep getting evicted when they shouldn't

We have the following scenario:

We run gitlab-runner in Kubernetes (EKS, K8s version 1.3.1), using the Kubernetes executor, with all job pods being run on a dedicated, autoscaling node group. This node group is managed by CA (version 1.32).

All Gitlab job pods have the following properties:

- not backed by a controller object
- local storage (EmptyDir)
- Annotation `cluster-autoscaler.kubernetes.io/safe-to-evict` is set to `false`

If I am reading the [FAQ](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node) correctly, then any of these properties should stop a pod from getting evicted. I would assume that a node which holds any such pod will not be taken into account when CA determines which nodes are unneeded. **Yet we keep seeing pods getting evicted**, which is particularly frustrating because a lot of them run Terraform configurations, which then get state-locked due to the job being disrupted.

We have experimented with tweaking the parameters, the current configuration in the Helm values file is

```
extraArgs:
  scale-down-utilization-threshold: 0.01
  scale-down-unneeded-time: 15m
  cordon-node-before-terminating: true
  ignore-daemonsets-utilization: true
```
We tried to extend `node-delete-delay-after-taint` but that just leads to nodes sitting around hard-tainted and unusable, blocking the node group from being scaled up, and therefore new jobs not being able to schedule.

Our best guess at the moment is that CA does not appear to take pending pods into account when marking a node as unneeded, and then these pods stay on the node and get evicted when the node is finally scaled down. I know that the soft taint does not prevent new pods from being scheduled on a node, but still, I don't understand why none of the above properties, which according to the documentation *should* stop a pod from being evicted, don't seem to do so.

There are no errors in the cluster-autoscaler pod logs.

Is there anything else we can consider? This is starting to cause frustration among devs, as they have to keep rerunning CI/CD jobs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pods keep getting evicted when they shouldn't #8196

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pods keep getting evicted when they shouldn't #8196

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions