Tutorial - katalyst colocation best-practices

This guide introduces best-practices for colocation in katalyst with an end-to-end example. Follow the steps below to take a glance at the integrated functionality, and then you can replace the sample yamls with your workload when applying the system in your own production environment.

Prerequisite

Please make sure you have deployed all pre-dependent components before moving on to the next step.

Install enhanced kubernetes based on install-enhanced-k8s.md
Install components according to the instructions in Charts. To enable full functionality of colocation, the following components are required while others are optional.
- Agent
- Controller
- Scheduler

Functionalities

Before going to the next step, let's assume that we will use those settings and configurations as baseline:

Total resources are set as 48 cores and 195924424Ki per node;
Reserved resources for pods with shared_cores are set as 4 cores and 5Gi, and it means that we'll always keep at least this amount of resources for those pods for bursting requirements.

Based on the assumption above, you can follow the steps to deep dive into the colocation workflow.

Resource Reporting

After installing, resource reporting module will report reclaimed resources. Since there are no pods running, the reclaimed resource will be calculated as: reclaimed_resources = total_resources - reserve_resources

When you refer to CNR, reclaimed resources will be as follows, and it means that pods with reclaimed_cores can be scheduled onto this node.

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_memory: "195257901056"
        katalyst.kubewharf.io/reclaimed_millicpu: 44k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_memory: "195257901056"
        katalyst.kubewharf.io/reclaimed_millicpu: 44k

Submit several pods with shared_cores, and put pressure on those workloads to make reclaimed resources fluctuate along with the running state of workload.

$ kubectl create -f ./examples/shared-normal-pod.yaml

After successfully scheduled, the pod starts running with cpu-usage ~= 1cores and cpu-load ~=1, and the reclaimed resources will be changed according to the formula below. We skip memory here since it's more difficult to reproduce with accurate value than cpu, but the principle is familiar. reclaim cpu = allocatable - round(ceil(reserve + max(usage,load.1min,load.5min))

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_millicpu: 42k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_millicpu: 42k

You can then put pressure on those pods to simulate requested peaks with stress, and the cpu-load will rise to approximately 3 to make the reclaimed cpu shrink to 40k.

$ kubectl exec shared-normal-pod -it -- stress -c 2

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_millicpu: 40k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_millicpu: 40k

Scheduling Strategy

Katalyst provides several scheduling strategies to schedule pods with reclaimed_cores. You can alter the default scheduling config, and then create deployment with reclaimed_cores.

$ kubectl create -f ./examples/reclaimed-deployment.yaml

Spread

Spread is the default scheduling strategy, and it will try to spread pods among all suitable nodes, and it is usually used to balance workload contention in the cluster. Apply Spread policy with the command below, and pod will be scheduled onto each node evenly.

$ kubectl apply -f ./examples/scheduler-policy-spread.yaml
$ kubectl get po -owide

NAME                             READY   STATUS    RESTARTS   AGE     IP              NODE          NOMINATED NODE   READINESS GATES
reclaimed-pod-5f7f69d7b8-4lknl   1/1     Running   0          3m31s   192.168.1.169   node-1   <none>           <none>
reclaimed-pod-5f7f69d7b8-656bz   1/1     Running   0          3m31s   192.168.2.103   node-2   <none>           <none>
reclaimed-pod-5f7f69d7b8-89n46   1/1     Running   0          3m31s   192.168.0.129   node-3   <none>           <none>
reclaimed-pod-5f7f69d7b8-bcpbs   1/1     Running   0          3m31s   192.168.1.171   node-1   <none>           <none>
reclaimed-pod-5f7f69d7b8-bq22q   1/1     Running   0          3m31s   192.168.0.126   node-3   <none>           <none>
reclaimed-pod-5f7f69d7b8-jblgk   1/1     Running   0          3m31s   192.168.0.128   node-3   <none>           <none>
reclaimed-pod-5f7f69d7b8-kxqdl   1/1     Running   0          3m31s   192.168.0.127   node-3   <none>           <none>
reclaimed-pod-5f7f69d7b8-mdh2d   1/1     Running   0          3m31s   192.168.1.170   node-1   <none>           <none>
reclaimed-pod-5f7f69d7b8-p2q7s   1/1     Running   0          3m31s   192.168.2.104   node-2   <none>           <none>
reclaimed-pod-5f7f69d7b8-x7lqh   1/1     Running   0          3m31s   192.168.2.102   node-2   <none>           <none>

Binpack

Binpack tries to schedule pods into a single node, until the node is unsuitable to schedule more, and it is usually used to squeeze resource utilization to a limited bound to rise utilization. Apply Binpack policy with the command below, and pods will be scheduled onto one intensive node.

$ kubectl apply -f ./examples/scheduler-policy-binpack.yaml
$ kubectl get po -owide

NAME                             READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
reclaimed-pod-5f7f69d7b8-7mjbz   1/1     Running   0          36s   192.168.1.176   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-h8nmk   1/1     Running   0          36s   192.168.1.177   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-hfhqt   1/1     Running   0          36s   192.168.1.181   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-nhx4h   1/1     Running   0          36s   192.168.1.182   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-s8sx7   1/1     Running   0          36s   192.168.1.178   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-szn8z   1/1     Running   0          36s   192.168.1.180   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-vdm7c   1/1     Running   0          36s   192.168.0.133   node-3    <none>           <none>
reclaimed-pod-5f7f69d7b8-vrr8w   1/1     Running   0          36s   192.168.1.179   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-w9hv4   1/1     Running   0          36s   192.168.2.109   node-2    <none>           <none>
reclaimed-pod-5f7f69d7b8-z2wqv   1/1     Running   0          36s   192.168.4.200   node-4    <none>           <none>

Custom

Besides those in-tree policies, you can also use self-defined scoring functions to customize scheduling strategy. In the example below, we use self-defined RequestedToCapacityRatio scorer as scheduling policy, and it will work the same as Binpack policy.

$ kubectl apply -f ./examples/scheduler-policy-custom.yaml
$ kubectl get po -owide

NAME                             READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
reclaimed-pod-5f7f69d7b8-547zk   1/1     Running   0          7s    192.168.1.191   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-6jzbs   1/1     Running   0          6s    192.168.1.193   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-6v7kr   1/1     Running   0          7s    192.168.2.111   node-2    <none>           <none>
reclaimed-pod-5f7f69d7b8-9vrb9   1/1     Running   0          6s    192.168.1.192   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-dnn7n   1/1     Running   0          7s    192.168.4.204   node-4   <none>           <none>
reclaimed-pod-5f7f69d7b8-jtgx9   1/1     Running   0          7s    192.168.1.189   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-kjrlv   1/1     Running   0          7s    192.168.0.139   node-3    <none>           <none>
reclaimed-pod-5f7f69d7b8-mr85t   1/1     Running   0          6s    192.168.1.194   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-q4dz5   1/1     Running   0          7s    192.168.1.188   node-1    <none>           <none>
reclaimed-pod-5f7f69d7b8-v28nv   1/1     Running   0          7s    192.168.1.190   node-1    <none>           <none>

QoS Controlling

After successfully scheduled, katalyst falls into the main QoS-controlling loop to dynamically adjust resource allocations in each node. In the current version, we use cpuset to isolate scheduling domain for pods in each pool, and memory limit to restrict upper bound of memory usage.

Before going to the next step, remember to clear previous pods to construct a pure environment.

Apply a pod with shared_cores as the command below. After ramp-up period, the cpuset for shared-pool will be 6 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 2 cores for regular requirements). And the left cores are considered suitable for reclaimed pods.

$ kubectl create -f ./examples/shared-normal-pod.yaml

root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod
Tue Jan  3 16:18:31 CST 2023
11,22-23,35,46-47

Apply a pod with reclaimed_cores as the command below, and the cpuset for reclaimed-pool will be 40 cores.

kubectl create -f ./examples/reclaimed-normal-pod.yaml

root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod
Tue Jan  3 16:23:20 CST 2023
0-10,12-21,24-34,36-45

Put pressure on the previous pod with shared_cores to make its load rise to 3, and the cpuset for shared-pool will be 8 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 4 cores for regular requirements). And cores for reclaimed pool will shrink to 48 relatively.

$ kubectl exec reclaimed-normal-pod -it -- stress -c 2

root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod
Tue Jan  3 16:25:23 CST 2023
10-11,22-23,34-35,46-47

root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod
Tue Jan  3 16:28:32 CST 2023
0-9,12-21,24-33,36-45

Pod Eviction

Eviction is usually used as a common back-and-force method in case that the QoS fails to be satisfied, and we should always make sure pods with higher priority (i.e. shared_cores) to meet its QoS by evicting pods with lower priority (i.e. reclaimed_cores). Katalyst contains both agent and central evictions to meet different requirements.

Before going to the next step, remember to clear previous pods to construct a pure environment.

Agent Eviction

Currently, katalyst provides several in-tree agent eviction implementations.

Resource OverCommit

Since reclaimed resources are always fluctuating according to the running state of pods with shared_cores, it may reduce shrink to a critical point that even pods will reclaimed pods can not run properly. In this case, katalyst will evict them to rebalance them to other nodes, and the comparison formula is as follows: sum(requested_reclaimed_resource) > alloctable_reclaimed_resource * threshold

Apply several pods (including shared_cores and reclaimed_cores), and put some pressure to reduce allocatable reclaimed resources until it is below the tolerance threshold. It will finally trigger to evict pod reclaimed-large-pod-2.

$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml

$ kubectl exec shared-large-pod-2 -it -- stress -c 40

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_millicpu: 4k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_millicpu: 4k

$ kubectl get event -A | grep evict
default     43s         Normal   EvictCreated     pod/reclaimed-large-pod-2   Successfully create eviction; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin
default     8s          Normal   EvictSucceeded   pod/reclaimed-large-pod-2   Evicted pod has been deleted physically; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin

The default threshold for reclaimed resources 5, you can change it dynamically with KCC.

$ kubectl create -f ./examples/kcc-eviction-reclaimed-resource-config.yaml

Memory

Memory eviction is implemented in two parts: numa-level eviction and system-level eviction. The former is used along with numa-binding enhancement, while the latter is used for more general cases. In this tutorial, we will mainly demonstrate the latter. For each level, katalyst will trigger memory eviciton based on memory usage and Kswapd active rate to avoid slow path for memory allocation in kernel.

Apply several pods (including shared_cores and reclaimed_cores).

$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml

Apply KCC to alert the default free memory and Kswapd rate threshold.

$ kubectl create -f ./examples/kcc-eviction-memory-system-config.yaml

For Memory Usage

Exec into reclaimed-large-pod-2 and request enough memory. When memory free falls below the target, it will trigger eviction for pods with reclaimed cores, and it will choose the pod that uses the most memory.

$ kubectl exec -it reclaimed-large-pod-2 bash
$ stress --vm 1 --vm-bytes 175G --vm-hang 1000 --verbose

$ kubectl get event -A | grep evict
default     2m40s       Normal   EvictCreated     pod/reclaimed-large-pod-2   Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
default     2m5s        Normal   EvictSucceeded   pod/reclaimed-large-pod-2   Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin

taints:
- effect: NoSchedule
  key: node.katalyst.kubewharf.io/MemoryPressure
  timeAdded: "2023-01-09T06:32:08Z"

For Kswapd

Login into the working node and put some pressure on system memory. When Kswapd active rates exceed the target threshold (default = 1), it will trigger eviction for pods both for reclaimed cores and shared cores, but reclaimed cores will be prior to shared cores.

$ stress --vm 1 --vm-bytes 180G --vm-hang 1000 --verbose

$ kubectl get event -A | grep evict
default           2m2s        Normal    EvictCreated              pod/reclaimed-large-pod-2          Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
default           92s         Normal    EvictSucceeded            pod/reclaimed-large-pod-2          Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin

taints:
- effect: NoSchedule
  key: node.katalyst.kubewharf.io/MemoryPressure
  timeAdded: "2023-01-09T06:32:08Z"

Load

For pods with shared_cores, if any pod creates too many threads, the scheduling-period in cfs may be split into small pieces, and makes throttle more frequent, thus impacts workload performance. To solve this, katalyst implements load eviction to detect load counts and trigger taint and eviction actions based on threshold, and the comparison formula is as follows. soft: load > resource_pool_cpu_amount hard: load > resource_pool_cpu_amount * threshold

Apply several pods (including shared_cores and reclaimed_cores).

$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml

put some pressure to reduce allocatable reclaimed resources until the load exceeds the soft bound. In this case, taint will be added in CNR to avoid scheduling new pods, but the existing pods will keep running.

$ kubectl exec shared-large-pod-2 -it -- stress -c 50

taints:
- effect: NoSchedule
  key: node.katalyst.kubewharf.io/CPUPressure
  timeAdded: "2023-01-05T05:26:51Z"

put more pressure to reduce allocatable reclaimed resources until the load exceeds the hard bound. In this case, katalyst will evict the pods that create the most amount of threads.

$ kubectl exec shared-large-pod-2 -it -- stress -c 100

$ kubectl get event -A | grep evict
67s         Normal   EvictCreated     pod/shared-large-pod-2      Successfully create eviction; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin
68s         Normal   Killing          pod/shared-large-pod-2      Stopping container stress
32s         Normal   EvictSucceeded   pod/shared-large-pod-2      Evicted pod has been deleted physically; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin

Centralized Eviction

In some cases, the agents may suffer from the single point problem, i.e. in a large cluster, the daemon may fail to work because of a lot of abnormal cases, and it may cause the pods running in this node out of control. So, centralized eviction by katalyst will try to evict all reclaimed pods to relieve this problem. By default, if the readiness state keeps failing for 10 minutes, katalyst will taint the CNR as unSchedubable to make sure no more pods with reclaimed_cores can be scheduled into this node. And if the readiness state keeps failing for 20 minutes, it will try to evict all pods with reclaimed_cores.

taints:
- effect: NoScheduleForReclaimedTasks
  key: node.kubernetes.io/unschedulable

Further More

We will try to provide more tutorials in the future along with feature releases in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colocation.md

colocation.md

Tutorial - katalyst colocation best-practices

Prerequisite

Functionalities

Resource Reporting

Scheduling Strategy

Spread

Binpack

Custom

QoS Controlling

Pod Eviction

Agent Eviction

Resource OverCommit

Memory

For Memory Usage

For Kswapd

Load

Centralized Eviction

Further More

Files

colocation.md

Latest commit

History

colocation.md

File metadata and controls

Tutorial - katalyst colocation best-practices

Prerequisite

Functionalities

Resource Reporting

Scheduling Strategy

Spread

Binpack

Custom

QoS Controlling

Pod Eviction

Agent Eviction

Resource OverCommit

Memory

For Memory Usage

For Kswapd

Load

Centralized Eviction

Further More