[feature request] PROPOSAL for CloneSet "Downscale Pod Picker" API

## Background

I recently created [KEP-3189](https://github.com/kubernetes/enhancements/issues/3189), which proposes a `downscalePodPicker` field for `ReplicaSetSpec` and `DeploymentSpec`, this field specifies a user-created "Pod Picker" REST API that informs which Pods are removed when `replicas` is decreased. 

As you might imagine, getting anything added to upstream K8S for workloads that aren't simple web APIs is like pulling teeth. I am hopeful that [KEP-3189](https://github.com/kubernetes/enhancements/issues/3189) succeeds upstream, but I think OpenKruise will be more receptive and understand the benefits of this feature. 

Therefore, I ask that the OpenKruise project considers a "Downscale Pod Picker" API for the `CloneSet` resource.

> For full details about KEP-3189, see PR https://github.com/kubernetes/enhancements/pull/3190
> 
> [Motivation](https://github.com/kubernetes/enhancements/blob/8beae1b5f2f9e7ed7814ce7a711c8633fe6f46d3/keps/sig-apps/3189-downscale-pod-picker/README.md#motivation)
> [Proposal / User-Stories](https://github.com/kubernetes/enhancements/blob/8beae1b5f2f9e7ed7814ce7a711c8633fe6f46d3/keps/sig-apps/3189-downscale-pod-picker/README.md#proposal)
> [Design Details](https://github.com/kubernetes/enhancements/blob/8beae1b5f2f9e7ed7814ce7a711c8633fe6f46d3/keps/sig-apps/3189-downscale-pod-picker/README.md#design-details)

## Technical Details (from KEP-3189)

The gist of [KEP-3189](https://github.com/kubernetes/enhancements/issues/3189) is that a new `downscalePodPicker` field will be added to `ReplicaSetSpec` and `DeploymentSpec`, this field specifies a user-created "Pod Picker" REST API that informs which Pods are removed when `replicas` is decreased. 

### K8S API Changes:

Here is an example ReplicaSet with the new `downscalePodPicker` field:

```yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: my-replicaset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-container
          image: my-image

  #############################
  ## this is the new section ##
  #############################
  downscalePodPicker:
    http:
      host: my-app.my-namespace.svc.cluster.local
      httpHeaders:
        - name: authentication
          valueFrom:
            secretKeyRef:
              key: authentication-token
              name: my-secret
      path: "/downscale-pod-picker"
      port: 443
      scheme: "https"
    maxRetries: 3
    timeoutSeconds: 5
```

### "Pod Picker" API contract:

The contract for a `downscalePodPicker` REST API will be as follows:

- Request payloads to the API will contain:
   - `number_of_pods_requested` (`int`): minimum number of Pods to return
   - `candidate_pods` (`list[string]`): list of Pod-names to choose from
- Response payloads from the API will contain:
   - `chosen_pods` (`list[string]`): list of Pod-names chosen to be removed
   - `tied_pods` (`list[string]`): list of Pod-names we can't decide between
- Other requirements:
   - both `chosen_pods` and `tied_pods` can be non-empty
   - total number of pods returned must be AT LEAST `number_of_pods_requested` (more if there are ties)
   - only Pod-names contained in `candidate_pods` may be returned
- NOTES:
   - The response payload is split into two lists, because when `number_of_pods_requested > 1`, it is possible to have some Pods which were definitely chosen, and others which we cannot decide between, but who are definitely the "next best" after the chosen Pods. (e.g. if the Pods have the following metrics `[1,2,2,2]` and `number_of_pods_requested = 2`, we might return `chosen_pods = [pod-1]`, `tied_pods = [pod-2, pod-3, pod-4]`)
  - This contract doesn't require that `downscalePodPicker` APIs make a decision about ALL `candidate_pods`, and allows them
to be designed such that they exit early from their search if they find enough good candidates before considering all Pods.
(e.g. if the API is looking for the least-active Pods, it can exit early if enough fully-idle Pods are found to meet `number_of_pods_requested`)
   - The controller will exclude Pods in Unassigned/PodPending/PodUnknown/Unready states from `candidate_pods` (however, the state of any Pod may change in the time it takes for the request to be processed).

### Controller Behaviour Changes:

On downscale, the `ReplicaSet` controller assigns all Pods a "rank" based on the return from "Pod Picker", with lower-value ranks being killed first (until enough Pods have been removed).
- Returned `chosen_pods` have `rank = 0`
- Returned `tied_pods` have `rank = 1`
- All remaining Pods have `rank = 3`

NOTE:  Unassigned/PodPending/PodUnknown/Unready Pods will always be killed first, if there are enough of them to fulfill the downscale, no calls will be made to the "Pod Picker" API

## User Stories (from KEP-3189)

### Story 1:

As a Data Platform Engineer, I want to run [Apache Airflow](https://github.com/apache/airflow) on Kubernetes and autoscale the number of workers while killing the least active workers on downscale, this allows me to save money by not under-utilizing Nodes, and reduce wasted time by minimizing how many worker-tasks are impacted by scaling.

Solution:

- I can run my Airflow celery workers in a Deployment.
- I can use a [ScaledObject](https://keda.sh/docs/2.5/concepts/scaling-deployments/#scaledobject-spec) from
  [KEDA](https://github.com/kedacore/keda) to create a HorizontalPodAutoscaler that scales replicas based
  on current worker task load, using the [PostgreSQL Scaler](https://keda.sh/docs/2.5/scalers/postgresql/).
- I can create a REST API (with Python) for `downscalePodPicker` that runs in a Deployment, and answers a request to choose `N` Pods by:
   1. querying the Airflow Metadata DB to find how many tasks each worker is doing (weighting longer-running tasks higher)
   2. finding the `N` workers with the lowest weighting:
       - if multiple workers have the same weighting, return them as "tied pods"
       - if we find `N` workers doing nothing, we can exit early, and return those as "chosen pods"
   3. returning these lists of "chosen pods" and "tied pods"

### Story 2:

As a Data Engineer, I want to run an [Apache Spark](https://github.com/apache/spark) cluster and autoscale the number of workers while impacting the fewest running tasks on downscale, this allows me to save money by not under-utilizing Nodes, and reduce wasted time by minimising how many tasks are impacted by scaling.

Solution:

- (Similar to Story 1)

### Story 3:

As a Platform Engineer, I want to run a sharded Minecraft server on Kubernetes and autoscale the number of shards while impacting the fewest connected users on downscale, this allows me to save money by not under-utilizing Nodes, and improve user-experience by minimising the number of users impacted by scaling.

Solution:

- I can run my Minecraft server shards in a Deployment.
- I can use my in-house solution to control how many `replicas` the Deployment has.
- I can create a REST API (with Java) for `downscalePodPicker` that runs in a Deployment, and answers a request to choose `N` Pods by:
   1. keeping an in-memory cache of how many users are on each shard (weighting "premium" users higher)
   2. finding the `N` shards with the lowest user-load:
       - if multiple shards have the same weighting, return them as "tied pods"
       - if we find `N` empty shards, we can exit early, and return those as "chosen pods"
   3. returning these lists of "chosen pods" and "tied pods"

### Story 4:

As a Site Reliability Engineer, I want to ensure my NodeJS website maintains regional distribution when downscaling the number of replicas, this allows me to ensure uptime when a region experiences an outage.

Solution:

- I can run my NodeJS application in a Deployment.
- I can use a HorizontalPodAutoscaler with CPU metrics to control how many `replicas` the Deployment has.
- I can create a REST API (with TypeScript) for `downscalePodPicker` that runs in a Deployment, and answers a request to choose `N` Pods by:
   1. keeping track of which region each Pod is in
   2. finding `N` Pods that we can remove without violating the regional distribution requirements:
       - if there are multiple acceptable options, we could choose the `N` Pods which are doing the least work
   1. return these Pods as the "chosen pods"

## Alternatives (from KEP-3189)

### (REJECTED) Pod-Deletion-Cost annotation/status approach:

- DESCRIPTION: an annotation/status field is created on Pods that contains their current pod-deletion-cost
- OPTION 1: the annotation/status is always up-to-date 
    - PROBLEM 1: this will not scale if costs change quickly
        - every update requires a PATCH call to the kube-apiserver
    - PROBLEM 2: this will not scale for large numbers of Pods
        - every update requires a PATCH call to the kube-apiserver
    - PROBLEM 3: this is wasteful for deployments that rarely scale down
        - calculating and updating the cost could be expensive, and that work is wasted if not actually used to downscale
- OPTION 2: the annotation/status is updated only when downscaling
    - PROBLEM 1: existing scaling tools like HorozontalPodAutoscaler can't be used
       - the annotation/status would need to be updated BEFORE downscaling, and we can't predict when HorozontalPodAutoscaler will downscale 
       - therefore, users must write their own scalers that update the annotation/status before downscaling (making it inaccessible for most users)
    - PROBLEM 2: this will not scale if apps frequently downscale
        - after each downscale, any annotations added by the scaler will need to be cleared (they will become out of date)
        - we must wait to clear them before we can start the next downscale
    - PROBLEM 3: costs may be out-of-date by the time the controller picks which pod to downscale
        - consider apps like airflow/spark where workers may accept new tasks on a second-to-second basis
        - if updating the annotation/status takes too long, the cost annotation may be outdated, defeating the point

### (REJECTED) Pod-Deletion-Cost http/exec probe aproach:

- DESCRIPTION: a new http/exec probe is created for Pods which returns their current pod-deletion-cost
- OPTION 1: probes are used every X seconds to update a Pod status field
    - _(Suffers from the same problems as "OPTION 1" for the annotation/status approach)_
- OPTION 2: probes are ONLY used when downscaling
    - PROBLEM 1: the controller cannot make a probe request to pods (this must be done by the Node's kubelet)
       - to solve this you would need a complex system that has the controller "mark" the pods for probing  (possibly by creating an event), and then waits for some status to be updated
    - PROBLEM 2: this will not scale for large numbers of Pods
       - probes will take time to run
       - to solve this, you would need to use a heuristic approach, e.g. only checking a sample of Pods and returning the lowest cost from the sample

### (REJECTED) Pod-Deletion-Cost API aproach:

- DESCRIPTION: a central user-managed API is queried by the controller and returns the pod-deletion-cost of each Pod
- PROBLEM 1: the user-managed API cant exit-early 
   - because a pod-deletion-cost must be returned for each Pod, the API can't have the concept of a "free" Pod, which if found can be immediately returned without checking the other Pods
- PROBLEM 2: the user-managed API may calculate the pod-deletion-cost of more Pods than necessary
   - the API is unaware of how many Pods are actually planned to be removed, so will calculate the pod-deletion-cost of more Pods than is necessary

### (ACCEPTED) Pod-Picker API approach:

- DESCRIPTION: a central user-managed API is queried by the controller and chooses the N best Pods to kill from a list
- BENEFIT 1: users can implement any system they would like for deciding which pods to pick
    - e.g. they could incorporate http/exec probes into their Pod-Picker API
    - e.g. they could incorporate node information, like geographic distribution or VM cost
- BENEFIT 2: existing resources like HorozontalPodAutoscaler can be used out-of-box
- BENEFIT 3: many apps will already have a central system that tracks load on each "shard" or "worker"
   - those systems can be extended to include a Pod-Picker API


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature request] PROPOSAL for CloneSet "Downscale Pod Picker" API #902

Background

Technical Details (from KEP-3189)

K8S API Changes:

"Pod Picker" API contract:

Controller Behaviour Changes:

User Stories (from KEP-3189)

Story 1:

Story 2:

Story 3:

Story 4:

Alternatives (from KEP-3189)

(REJECTED) Pod-Deletion-Cost annotation/status approach:

(REJECTED) Pod-Deletion-Cost http/exec probe aproach:

(REJECTED) Pod-Deletion-Cost API aproach:

(ACCEPTED) Pod-Picker API approach:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature request] PROPOSAL for CloneSet "Downscale Pod Picker" API #902

Description

Background

Technical Details (from KEP-3189)

K8S API Changes:

"Pod Picker" API contract:

Controller Behaviour Changes:

User Stories (from KEP-3189)

Story 1:

Story 2:

Story 3:

Story 4:

Alternatives (from KEP-3189)

(REJECTED) Pod-Deletion-Cost annotation/status approach:

(REJECTED) Pod-Deletion-Cost http/exec probe aproach:

(REJECTED) Pod-Deletion-Cost API aproach:

(ACCEPTED) Pod-Picker API approach:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions