Description
Background
I recently created KEP-3189, which proposes a downscalePodPicker
field for ReplicaSetSpec
and DeploymentSpec
, this field specifies a user-created "Pod Picker" REST API that informs which Pods are removed when replicas
is decreased.
As you might imagine, getting anything added to upstream K8S for workloads that aren't simple web APIs is like pulling teeth. I am hopeful that KEP-3189 succeeds upstream, but I think OpenKruise will be more receptive and understand the benefits of this feature.
Therefore, I ask that the OpenKruise project considers a "Downscale Pod Picker" API for the CloneSet
resource.
For full details about KEP-3189, see PR kubernetes/enhancements#3190
Technical Details (from KEP-3189)
The gist of KEP-3189 is that a new downscalePodPicker
field will be added to ReplicaSetSpec
and DeploymentSpec
, this field specifies a user-created "Pod Picker" REST API that informs which Pods are removed when replicas
is decreased.
K8S API Changes:
Here is an example ReplicaSet with the new downscalePodPicker
field:
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: my-replicaset
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-container
image: my-image
#############################
## this is the new section ##
#############################
downscalePodPicker:
http:
host: my-app.my-namespace.svc.cluster.local
httpHeaders:
- name: authentication
valueFrom:
secretKeyRef:
key: authentication-token
name: my-secret
path: "/downscale-pod-picker"
port: 443
scheme: "https"
maxRetries: 3
timeoutSeconds: 5
"Pod Picker" API contract:
The contract for a downscalePodPicker
REST API will be as follows:
- Request payloads to the API will contain:
number_of_pods_requested
(int
): minimum number of Pods to returncandidate_pods
(list[string]
): list of Pod-names to choose from
- Response payloads from the API will contain:
chosen_pods
(list[string]
): list of Pod-names chosen to be removedtied_pods
(list[string]
): list of Pod-names we can't decide between
- Other requirements:
- both
chosen_pods
andtied_pods
can be non-empty - total number of pods returned must be AT LEAST
number_of_pods_requested
(more if there are ties) - only Pod-names contained in
candidate_pods
may be returned
- both
- NOTES:
- The response payload is split into two lists, because when
number_of_pods_requested > 1
, it is possible to have some Pods which were definitely chosen, and others which we cannot decide between, but who are definitely the "next best" after the chosen Pods. (e.g. if the Pods have the following metrics[1,2,2,2]
andnumber_of_pods_requested = 2
, we might returnchosen_pods = [pod-1]
,tied_pods = [pod-2, pod-3, pod-4]
) - This contract doesn't require that
downscalePodPicker
APIs make a decision about ALLcandidate_pods
, and allows them
to be designed such that they exit early from their search if they find enough good candidates before considering all Pods.
(e.g. if the API is looking for the least-active Pods, it can exit early if enough fully-idle Pods are found to meetnumber_of_pods_requested
) - The controller will exclude Pods in Unassigned/PodPending/PodUnknown/Unready states from
candidate_pods
(however, the state of any Pod may change in the time it takes for the request to be processed).
- The response payload is split into two lists, because when
Controller Behaviour Changes:
On downscale, the ReplicaSet
controller assigns all Pods a "rank" based on the return from "Pod Picker", with lower-value ranks being killed first (until enough Pods have been removed).
- Returned
chosen_pods
haverank = 0
- Returned
tied_pods
haverank = 1
- All remaining Pods have
rank = 3
NOTE: Unassigned/PodPending/PodUnknown/Unready Pods will always be killed first, if there are enough of them to fulfill the downscale, no calls will be made to the "Pod Picker" API
User Stories (from KEP-3189)
Story 1:
As a Data Platform Engineer, I want to run Apache Airflow on Kubernetes and autoscale the number of workers while killing the least active workers on downscale, this allows me to save money by not under-utilizing Nodes, and reduce wasted time by minimizing how many worker-tasks are impacted by scaling.
Solution:
- I can run my Airflow celery workers in a Deployment.
- I can use a ScaledObject from
KEDA to create a HorizontalPodAutoscaler that scales replicas based
on current worker task load, using the PostgreSQL Scaler. - I can create a REST API (with Python) for
downscalePodPicker
that runs in a Deployment, and answers a request to chooseN
Pods by:- querying the Airflow Metadata DB to find how many tasks each worker is doing (weighting longer-running tasks higher)
- finding the
N
workers with the lowest weighting:- if multiple workers have the same weighting, return them as "tied pods"
- if we find
N
workers doing nothing, we can exit early, and return those as "chosen pods"
- returning these lists of "chosen pods" and "tied pods"
Story 2:
As a Data Engineer, I want to run an Apache Spark cluster and autoscale the number of workers while impacting the fewest running tasks on downscale, this allows me to save money by not under-utilizing Nodes, and reduce wasted time by minimising how many tasks are impacted by scaling.
Solution:
- (Similar to Story 1)
Story 3:
As a Platform Engineer, I want to run a sharded Minecraft server on Kubernetes and autoscale the number of shards while impacting the fewest connected users on downscale, this allows me to save money by not under-utilizing Nodes, and improve user-experience by minimising the number of users impacted by scaling.
Solution:
- I can run my Minecraft server shards in a Deployment.
- I can use my in-house solution to control how many
replicas
the Deployment has. - I can create a REST API (with Java) for
downscalePodPicker
that runs in a Deployment, and answers a request to chooseN
Pods by:- keeping an in-memory cache of how many users are on each shard (weighting "premium" users higher)
- finding the
N
shards with the lowest user-load:- if multiple shards have the same weighting, return them as "tied pods"
- if we find
N
empty shards, we can exit early, and return those as "chosen pods"
- returning these lists of "chosen pods" and "tied pods"
Story 4:
As a Site Reliability Engineer, I want to ensure my NodeJS website maintains regional distribution when downscaling the number of replicas, this allows me to ensure uptime when a region experiences an outage.
Solution:
- I can run my NodeJS application in a Deployment.
- I can use a HorizontalPodAutoscaler with CPU metrics to control how many
replicas
the Deployment has. - I can create a REST API (with TypeScript) for
downscalePodPicker
that runs in a Deployment, and answers a request to chooseN
Pods by:- keeping track of which region each Pod is in
- finding
N
Pods that we can remove without violating the regional distribution requirements:- if there are multiple acceptable options, we could choose the
N
Pods which are doing the least work
- if there are multiple acceptable options, we could choose the
- return these Pods as the "chosen pods"
Alternatives (from KEP-3189)
(REJECTED) Pod-Deletion-Cost annotation/status approach:
- DESCRIPTION: an annotation/status field is created on Pods that contains their current pod-deletion-cost
- OPTION 1: the annotation/status is always up-to-date
- PROBLEM 1: this will not scale if costs change quickly
- every update requires a PATCH call to the kube-apiserver
- PROBLEM 2: this will not scale for large numbers of Pods
- every update requires a PATCH call to the kube-apiserver
- PROBLEM 3: this is wasteful for deployments that rarely scale down
- calculating and updating the cost could be expensive, and that work is wasted if not actually used to downscale
- PROBLEM 1: this will not scale if costs change quickly
- OPTION 2: the annotation/status is updated only when downscaling
- PROBLEM 1: existing scaling tools like HorozontalPodAutoscaler can't be used
- the annotation/status would need to be updated BEFORE downscaling, and we can't predict when HorozontalPodAutoscaler will downscale
- therefore, users must write their own scalers that update the annotation/status before downscaling (making it inaccessible for most users)
- PROBLEM 2: this will not scale if apps frequently downscale
- after each downscale, any annotations added by the scaler will need to be cleared (they will become out of date)
- we must wait to clear them before we can start the next downscale
- PROBLEM 3: costs may be out-of-date by the time the controller picks which pod to downscale
- consider apps like airflow/spark where workers may accept new tasks on a second-to-second basis
- if updating the annotation/status takes too long, the cost annotation may be outdated, defeating the point
- PROBLEM 1: existing scaling tools like HorozontalPodAutoscaler can't be used
(REJECTED) Pod-Deletion-Cost http/exec probe aproach:
- DESCRIPTION: a new http/exec probe is created for Pods which returns their current pod-deletion-cost
- OPTION 1: probes are used every X seconds to update a Pod status field
- (Suffers from the same problems as "OPTION 1" for the annotation/status approach)
- OPTION 2: probes are ONLY used when downscaling
- PROBLEM 1: the controller cannot make a probe request to pods (this must be done by the Node's kubelet)
- to solve this you would need a complex system that has the controller "mark" the pods for probing (possibly by creating an event), and then waits for some status to be updated
- PROBLEM 2: this will not scale for large numbers of Pods
- probes will take time to run
- to solve this, you would need to use a heuristic approach, e.g. only checking a sample of Pods and returning the lowest cost from the sample
- PROBLEM 1: the controller cannot make a probe request to pods (this must be done by the Node's kubelet)
(REJECTED) Pod-Deletion-Cost API aproach:
- DESCRIPTION: a central user-managed API is queried by the controller and returns the pod-deletion-cost of each Pod
- PROBLEM 1: the user-managed API cant exit-early
- because a pod-deletion-cost must be returned for each Pod, the API can't have the concept of a "free" Pod, which if found can be immediately returned without checking the other Pods
- PROBLEM 2: the user-managed API may calculate the pod-deletion-cost of more Pods than necessary
- the API is unaware of how many Pods are actually planned to be removed, so will calculate the pod-deletion-cost of more Pods than is necessary
(ACCEPTED) Pod-Picker API approach:
- DESCRIPTION: a central user-managed API is queried by the controller and chooses the N best Pods to kill from a list
- BENEFIT 1: users can implement any system they would like for deciding which pods to pick
- e.g. they could incorporate http/exec probes into their Pod-Picker API
- e.g. they could incorporate node information, like geographic distribution or VM cost
- BENEFIT 2: existing resources like HorozontalPodAutoscaler can be used out-of-box
- BENEFIT 3: many apps will already have a central system that tracks load on each "shard" or "worker"
- those systems can be extended to include a Pod-Picker API