Skip to content

[feature request] PROPOSAL for CloneSet "Downscale Pod Picker" API #902

Closed
@thesuperzapper

Description

@thesuperzapper

Background

I recently created KEP-3189, which proposes a downscalePodPicker field for ReplicaSetSpec and DeploymentSpec, this field specifies a user-created "Pod Picker" REST API that informs which Pods are removed when replicas is decreased.

As you might imagine, getting anything added to upstream K8S for workloads that aren't simple web APIs is like pulling teeth. I am hopeful that KEP-3189 succeeds upstream, but I think OpenKruise will be more receptive and understand the benefits of this feature.

Therefore, I ask that the OpenKruise project considers a "Downscale Pod Picker" API for the CloneSet resource.

For full details about KEP-3189, see PR kubernetes/enhancements#3190

Motivation
Proposal / User-Stories
Design Details

Technical Details (from KEP-3189)

The gist of KEP-3189 is that a new downscalePodPicker field will be added to ReplicaSetSpec and DeploymentSpec, this field specifies a user-created "Pod Picker" REST API that informs which Pods are removed when replicas is decreased.

K8S API Changes:

Here is an example ReplicaSet with the new downscalePodPicker field:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: my-replicaset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-container
          image: my-image

  #############################
  ## this is the new section ##
  #############################
  downscalePodPicker:
    http:
      host: my-app.my-namespace.svc.cluster.local
      httpHeaders:
        - name: authentication
          valueFrom:
            secretKeyRef:
              key: authentication-token
              name: my-secret
      path: "/downscale-pod-picker"
      port: 443
      scheme: "https"
    maxRetries: 3
    timeoutSeconds: 5

"Pod Picker" API contract:

The contract for a downscalePodPicker REST API will be as follows:

  • Request payloads to the API will contain:
    • number_of_pods_requested (int): minimum number of Pods to return
    • candidate_pods (list[string]): list of Pod-names to choose from
  • Response payloads from the API will contain:
    • chosen_pods (list[string]): list of Pod-names chosen to be removed
    • tied_pods (list[string]): list of Pod-names we can't decide between
  • Other requirements:
    • both chosen_pods and tied_pods can be non-empty
    • total number of pods returned must be AT LEAST number_of_pods_requested (more if there are ties)
    • only Pod-names contained in candidate_pods may be returned
  • NOTES:
    • The response payload is split into two lists, because when number_of_pods_requested > 1, it is possible to have some Pods which were definitely chosen, and others which we cannot decide between, but who are definitely the "next best" after the chosen Pods. (e.g. if the Pods have the following metrics [1,2,2,2] and number_of_pods_requested = 2, we might return chosen_pods = [pod-1], tied_pods = [pod-2, pod-3, pod-4])
    • This contract doesn't require that downscalePodPicker APIs make a decision about ALL candidate_pods, and allows them
      to be designed such that they exit early from their search if they find enough good candidates before considering all Pods.
      (e.g. if the API is looking for the least-active Pods, it can exit early if enough fully-idle Pods are found to meet number_of_pods_requested)
    • The controller will exclude Pods in Unassigned/PodPending/PodUnknown/Unready states from candidate_pods (however, the state of any Pod may change in the time it takes for the request to be processed).

Controller Behaviour Changes:

On downscale, the ReplicaSet controller assigns all Pods a "rank" based on the return from "Pod Picker", with lower-value ranks being killed first (until enough Pods have been removed).

  • Returned chosen_pods have rank = 0
  • Returned tied_pods have rank = 1
  • All remaining Pods have rank = 3

NOTE: Unassigned/PodPending/PodUnknown/Unready Pods will always be killed first, if there are enough of them to fulfill the downscale, no calls will be made to the "Pod Picker" API

User Stories (from KEP-3189)

Story 1:

As a Data Platform Engineer, I want to run Apache Airflow on Kubernetes and autoscale the number of workers while killing the least active workers on downscale, this allows me to save money by not under-utilizing Nodes, and reduce wasted time by minimizing how many worker-tasks are impacted by scaling.

Solution:

  • I can run my Airflow celery workers in a Deployment.
  • I can use a ScaledObject from
    KEDA to create a HorizontalPodAutoscaler that scales replicas based
    on current worker task load, using the PostgreSQL Scaler.
  • I can create a REST API (with Python) for downscalePodPicker that runs in a Deployment, and answers a request to choose N Pods by:
    1. querying the Airflow Metadata DB to find how many tasks each worker is doing (weighting longer-running tasks higher)
    2. finding the N workers with the lowest weighting:
      • if multiple workers have the same weighting, return them as "tied pods"
      • if we find N workers doing nothing, we can exit early, and return those as "chosen pods"
    3. returning these lists of "chosen pods" and "tied pods"

Story 2:

As a Data Engineer, I want to run an Apache Spark cluster and autoscale the number of workers while impacting the fewest running tasks on downscale, this allows me to save money by not under-utilizing Nodes, and reduce wasted time by minimising how many tasks are impacted by scaling.

Solution:

  • (Similar to Story 1)

Story 3:

As a Platform Engineer, I want to run a sharded Minecraft server on Kubernetes and autoscale the number of shards while impacting the fewest connected users on downscale, this allows me to save money by not under-utilizing Nodes, and improve user-experience by minimising the number of users impacted by scaling.

Solution:

  • I can run my Minecraft server shards in a Deployment.
  • I can use my in-house solution to control how many replicas the Deployment has.
  • I can create a REST API (with Java) for downscalePodPicker that runs in a Deployment, and answers a request to choose N Pods by:
    1. keeping an in-memory cache of how many users are on each shard (weighting "premium" users higher)
    2. finding the N shards with the lowest user-load:
      • if multiple shards have the same weighting, return them as "tied pods"
      • if we find N empty shards, we can exit early, and return those as "chosen pods"
    3. returning these lists of "chosen pods" and "tied pods"

Story 4:

As a Site Reliability Engineer, I want to ensure my NodeJS website maintains regional distribution when downscaling the number of replicas, this allows me to ensure uptime when a region experiences an outage.

Solution:

  • I can run my NodeJS application in a Deployment.
  • I can use a HorizontalPodAutoscaler with CPU metrics to control how many replicas the Deployment has.
  • I can create a REST API (with TypeScript) for downscalePodPicker that runs in a Deployment, and answers a request to choose N Pods by:
    1. keeping track of which region each Pod is in
    2. finding N Pods that we can remove without violating the regional distribution requirements:
      • if there are multiple acceptable options, we could choose the N Pods which are doing the least work
    3. return these Pods as the "chosen pods"

Alternatives (from KEP-3189)

(REJECTED) Pod-Deletion-Cost annotation/status approach:

  • DESCRIPTION: an annotation/status field is created on Pods that contains their current pod-deletion-cost
  • OPTION 1: the annotation/status is always up-to-date
    • PROBLEM 1: this will not scale if costs change quickly
      • every update requires a PATCH call to the kube-apiserver
    • PROBLEM 2: this will not scale for large numbers of Pods
      • every update requires a PATCH call to the kube-apiserver
    • PROBLEM 3: this is wasteful for deployments that rarely scale down
      • calculating and updating the cost could be expensive, and that work is wasted if not actually used to downscale
  • OPTION 2: the annotation/status is updated only when downscaling
    • PROBLEM 1: existing scaling tools like HorozontalPodAutoscaler can't be used
      • the annotation/status would need to be updated BEFORE downscaling, and we can't predict when HorozontalPodAutoscaler will downscale
      • therefore, users must write their own scalers that update the annotation/status before downscaling (making it inaccessible for most users)
    • PROBLEM 2: this will not scale if apps frequently downscale
      • after each downscale, any annotations added by the scaler will need to be cleared (they will become out of date)
      • we must wait to clear them before we can start the next downscale
    • PROBLEM 3: costs may be out-of-date by the time the controller picks which pod to downscale
      • consider apps like airflow/spark where workers may accept new tasks on a second-to-second basis
      • if updating the annotation/status takes too long, the cost annotation may be outdated, defeating the point

(REJECTED) Pod-Deletion-Cost http/exec probe aproach:

  • DESCRIPTION: a new http/exec probe is created for Pods which returns their current pod-deletion-cost
  • OPTION 1: probes are used every X seconds to update a Pod status field
    • (Suffers from the same problems as "OPTION 1" for the annotation/status approach)
  • OPTION 2: probes are ONLY used when downscaling
    • PROBLEM 1: the controller cannot make a probe request to pods (this must be done by the Node's kubelet)
      • to solve this you would need a complex system that has the controller "mark" the pods for probing (possibly by creating an event), and then waits for some status to be updated
    • PROBLEM 2: this will not scale for large numbers of Pods
      • probes will take time to run
      • to solve this, you would need to use a heuristic approach, e.g. only checking a sample of Pods and returning the lowest cost from the sample

(REJECTED) Pod-Deletion-Cost API aproach:

  • DESCRIPTION: a central user-managed API is queried by the controller and returns the pod-deletion-cost of each Pod
  • PROBLEM 1: the user-managed API cant exit-early
    • because a pod-deletion-cost must be returned for each Pod, the API can't have the concept of a "free" Pod, which if found can be immediately returned without checking the other Pods
  • PROBLEM 2: the user-managed API may calculate the pod-deletion-cost of more Pods than necessary
    • the API is unaware of how many Pods are actually planned to be removed, so will calculate the pod-deletion-cost of more Pods than is necessary

(ACCEPTED) Pod-Picker API approach:

  • DESCRIPTION: a central user-managed API is queried by the controller and chooses the N best Pods to kill from a list
  • BENEFIT 1: users can implement any system they would like for deciding which pods to pick
    • e.g. they could incorporate http/exec probes into their Pod-Picker API
    • e.g. they could incorporate node information, like geographic distribution or VM cost
  • BENEFIT 2: existing resources like HorozontalPodAutoscaler can be used out-of-box
  • BENEFIT 3: many apps will already have a central system that tracks load on each "shard" or "worker"
    • those systems can be extended to include a Pod-Picker API

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions