Skip to content

Pods keep getting evicted when they shouldn't #8196

Open
@alex-hempel

Description

@alex-hempel

We have the following scenario:

We run gitlab-runner in Kubernetes (EKS, K8s version 1.3.1), using the Kubernetes executor, with all job pods being run on a dedicated, autoscaling node group. This node group is managed by CA (version 1.32).

All Gitlab job pods have the following properties:

  • not backed by a controller object
  • local storage (EmptyDir)
  • Annotation cluster-autoscaler.kubernetes.io/safe-to-evict is set to false

If I am reading the FAQ correctly, then any of these properties should stop a pod from getting evicted. I would assume that a node which holds any such pod will not be taken into account when CA determines which nodes are unneeded. Yet we keep seeing pods getting evicted, which is particularly frustrating because a lot of them run Terraform configurations, which then get state-locked due to the job being disrupted.

We have experimented with tweaking the parameters, the current configuration in the Helm values file is

extraArgs:
  scale-down-utilization-threshold: 0.01
  scale-down-unneeded-time: 15m
  cordon-node-before-terminating: true
  ignore-daemonsets-utilization: true

We tried to extend node-delete-delay-after-taint but that just leads to nodes sitting around hard-tainted and unusable, blocking the node group from being scaled up, and therefore new jobs not being able to schedule.

Our best guess at the moment is that CA does not appear to take pending pods into account when marking a node as unneeded, and then these pods stay on the node and get evicted when the node is finally scaled down. I know that the soft taint does not prevent new pods from being scheduled on a node, but still, I don't understand why none of the above properties, which according to the documentation should stop a pod from being evicted, don't seem to do so.

There are no errors in the cluster-autoscaler pod logs.

Is there anything else we can consider? This is starting to cause frustration among devs, as they have to keep rerunning CI/CD jobs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions