|
1 | | -# Creating Autoscaling clusters using API server |
| 1 | +# Creating Autoscaling clusters using APIServer |
2 | 2 |
|
3 | | -One of the fundamental features of Ray is autoscaling. This [document] describes how to set up |
4 | | -autoscaling using Ray operator. Here we will describe how to set it up using API server. |
| 3 | +One of Ray's key features is autoscaling. This [document] explains how to set up autoscaling |
| 4 | +with the Ray operator. Here, we demonstrate how to configure it using the APIServer and |
| 5 | +run an example. |
5 | 6 |
|
6 | | -## Deploy KubeRay operator and API server |
| 7 | +## Setup |
7 | 8 |
|
8 | | -Refer to [readme](README.md) for setting up KubRay operator and API server. |
| 9 | +Refer to the [README](README.md) for setting up the KubeRay operator and APIServer. |
9 | 10 |
|
10 | | -```shell |
11 | | -make operator-image cluster load-operator-image deploy-operator |
| 11 | +## Example |
| 12 | + |
| 13 | +This example walks through how to trigger scale-up and scale-down for RayCluster. |
| 14 | + |
| 15 | +Before proceeding with the example, remove any running RayClusters to ensure a successful |
| 16 | +execution of the steps below. |
| 17 | + |
| 18 | +```sh |
| 19 | +kubectl delete raycluster --all |
12 | 20 | ``` |
13 | 21 |
|
14 | | -Alternatively, you could build and deploy the Operator and API server from local repo for |
15 | | -development purpose. |
| 22 | +> [!IMPORTANT] |
| 23 | +> All the following guidance requires you to switch your working directory to the KubeRay `apiserver` |
| 24 | +
|
| 25 | +### Install ConfigMap |
16 | 26 |
|
17 | | -```shell |
18 | | -make operator-image cluster load-operator-image deploy-operator docker-image load-image deploy |
| 27 | +Install this [ConfigMap], which contains the code for our example. Simply download |
| 28 | +the file and run: |
| 29 | + |
| 30 | +```sh |
| 31 | +kubectl apply -f test/cluster/cluster/detachedactor.yaml |
19 | 32 | ``` |
20 | 33 |
|
21 | | -Additionally install this [ConfigMap] containing code that we will use for testing. |
| 34 | +Check if the ConfigMap is successfully created. You should see `ray-example` in the list: |
| 35 | + |
| 36 | +```sh |
| 37 | +kubectl get configmaps |
| 38 | +# NAME DATA AGE |
| 39 | +# ray-example 2 8s |
| 40 | +``` |
22 | 41 |
|
23 | | -## Deploy Ray cluster |
| 42 | +### Deploy RayCluster |
24 | 43 |
|
25 | | -Once they are set up, you first need to create a Ray cluster using the following commands: |
| 44 | +Before running the example, deploy a RayCluster with the following command: |
26 | 45 |
|
27 | | -```shell |
| 46 | +```sh |
| 47 | +# Create compute template |
28 | 48 | curl -X POST 'localhost:31888/apis/v1/namespaces/default/compute_templates' \ |
29 | 49 | --header 'Content-Type: application/json' \ |
30 | | ---data '{ |
31 | | - "name": "default-template", |
32 | | - "namespace": "default", |
33 | | - "cpu": 2, |
34 | | - "memory": 4 |
35 | | -}' |
| 50 | +--data @docs/api-example/compute_template.json |
| 51 | + |
| 52 | +# Create RayCluster |
36 | 53 | curl -X POST 'localhost:31888/apis/v1/namespaces/default/clusters' \ |
37 | 54 | --header 'Content-Type: application/json' \ |
38 | | ---data '{ |
39 | | - "name": "test-cluster", |
40 | | - "namespace": "default", |
41 | | - "user": "boris", |
42 | | - "clusterSpec": { |
43 | | - "enableInTreeAutoscaling": true, |
44 | | - "autoscalerOptions": { |
45 | | - "upscalingMode": "Default", |
46 | | - "idleTimeoutSeconds": 30, |
47 | | - "cpu": "500m", |
48 | | - "memory": "512Mi" |
49 | | - }, |
50 | | - "headGroupSpec": { |
51 | | - "computeTemplate": "default-template", |
52 | | - "image": "rayproject/ray:2.9.0-py310", |
53 | | - "serviceType": "NodePort", |
54 | | - "rayStartParams": { |
55 | | - "dashboard-host": "0.0.0.0", |
56 | | - "metrics-export-port": "8080", |
57 | | - "num-cpus": "0" |
58 | | - }, |
59 | | - "volumes": [ |
60 | | - { |
61 | | - "name": "code-sample", |
62 | | - "mountPath": "/home/ray/samples", |
63 | | - "volumeType": "CONFIGMAP", |
64 | | - "source": "ray-example", |
65 | | - "items": { |
66 | | - "detached_actor.py": "detached_actor.py", |
67 | | - "terminate_detached_actor.py": "terminate_detached_actor.py" |
68 | | - } |
69 | | - } |
70 | | - ] |
71 | | - }, |
72 | | - "workerGroupSpec": [ |
73 | | - { |
74 | | - "groupName": "small-wg", |
75 | | - "computeTemplate": "default-template", |
76 | | - "image": "rayproject/ray:2.9.0-py310", |
77 | | - "replicas": 0, |
78 | | - "minReplicas": 0, |
79 | | - "maxReplicas": 5, |
80 | | - "rayStartParams": { |
81 | | - "node-ip-address": "$MY_POD_IP" |
82 | | - }, |
83 | | - "volumes": [ |
84 | | - { |
85 | | - "name": "code-sample", |
86 | | - "mountPath": "/home/ray/samples", |
87 | | - "volumeType": "CONFIGMAP", |
88 | | - "source": "ray-example", |
89 | | - "items": { |
90 | | - "detached_actor.py": "detached_actor.py", |
91 | | - "terminate_detached_actor.py": "terminate_detached_actor.py" |
92 | | - } |
93 | | - } |
94 | | - ] |
95 | | - } |
96 | | - ] |
97 | | - } |
98 | | -}' |
| 55 | +--data @docs/api-example/autoscaling_clusters.json |
99 | 56 | ``` |
100 | 57 |
|
101 | | -## Validate that Ray cluster is deployed correctly |
| 58 | +This command performs two main operations: |
102 | 59 |
|
103 | | -Run: |
| 60 | +1. Creates a compute template `default-template` that specifies resources to use during |
| 61 | + scale-up (2 CPUs and 4 GiB memory). |
104 | 62 |
|
105 | | -```shell |
106 | | -kubectl get pods |
107 | | -``` |
| 63 | +2. Deploys a RayCluster (test-cluster) with: |
| 64 | + - A head pod that manages the cluster |
| 65 | + - A worker group configured to scale between 0 and 5 replicas |
| 66 | + |
| 67 | +The worker group uses the following autoscalerOptions to control scaling behavior: |
| 68 | + |
| 69 | +- **`upscalingMode: "Default"`**: Default scaling behavior. Ray will scale up only as |
| 70 | +needed. |
| 71 | +- **`idleTimeoutSeconds: 30`**: If a worker pod remains idle (i.e., not running any tasks) |
| 72 | +for 30 seconds, it will be automatically removed. |
| 73 | +- **`cpu: "500m"`, `memory: "512Mi"`**: Defines the **minimum resource unit** Ray uses to |
| 74 | +assess scaling needs. If no worker pod has at least this much free capacity, Ray will |
| 75 | +trigger a scale-up and launch a new worker pod. |
| 76 | + |
| 77 | +> **Note:** These values **do not determine the actual size** of the worker pod. The |
| 78 | +> pod size comes from the `computeTemplate` (in this case, 2 CPUs and 4 GiB memory). |
108 | 79 |
|
109 | | -You should get something like this: |
| 80 | +### Validate that RayCluster is deployed correctly |
110 | 81 |
|
111 | | -```shell |
112 | | -test-cluster-head-pr25j 2/2 Running 0 2m49s |
| 82 | +Run the following command to get a list of pods running. You should see something like below: |
| 83 | + |
| 84 | +```sh |
| 85 | +kubectl get pods |
| 86 | +# NAME READY STATUS RESTARTS AGE |
| 87 | +# kuberay-operator-545586d46c-f9grr 1/1 Running 0 49m |
| 88 | +# test-cluster-head 2/2 Running 0 3m1s |
113 | 89 | ``` |
114 | 90 |
|
115 | | -Note that only head pod is running and it has 2 containers |
| 91 | +Note that there is no worker for `test-cluster` as we set its initial replicas to 0. You |
| 92 | +will only see head pod with 2 containers for `test-cluster`. |
116 | 93 |
|
117 | | -## Trigger RayCluster scale-up |
| 94 | +### Trigger RayCluster scale-up |
118 | 95 |
|
119 | | -Create a detached actor: |
| 96 | +Create a detached actor to trigger scale-up with the following command: |
120 | 97 |
|
121 | 98 | ```sh |
122 | 99 | curl -X POST 'localhost:31888/apis/v1/namespaces/default/jobs' \ |
123 | 100 | --header 'Content-Type: application/json' \ |
124 | 101 | --data '{ |
125 | 102 | "name": "create-actor", |
126 | 103 | "namespace": "default", |
127 | | - "user": "boris", |
| 104 | + "user": "kuberay", |
128 | 105 | "entrypoint": "python /home/ray/samples/detached_actor.py actor1", |
129 | 106 | "clusterSelector": { |
130 | 107 | "ray.io/cluster": "test-cluster" |
131 | 108 | } |
132 | 109 | }' |
133 | 110 | ``` |
134 | 111 |
|
135 | | -Because we have specified `num_cpu: 0` for head node, this will cause creation of a worker node. Run: |
| 112 | +The `detached_actor.py` file is defined in the [ConfigMap] we installed earlier and |
| 113 | +mounted to the head node, which requires `num_cpus=1`. Recall that initially there is no |
| 114 | +worker pod exists, RayCluster needs to scale up a worker for running this actor. |
136 | 115 |
|
137 | | -```shell |
138 | | -kubectl get pods |
139 | | -``` |
| 116 | +Check if a worker is created. You should see a worker `test-cluster-small-wg-worker` spin |
| 117 | +up. |
140 | 118 |
|
141 | | -You should get something like this: |
| 119 | +```sh |
| 120 | +kubectl get pods |
142 | 121 |
|
143 | | -```shell |
144 | | -test-cluster-head-pr25j 2/2 Running 0 15m |
145 | | -test-cluster-worker-small-wg-qrjfm 1/1 Running 0 2m48s |
| 122 | +# NAME READY STATUS RESTARTS AGE |
| 123 | +# create-actor-tsvfc 0/1 Completed 0 99s |
| 124 | +# kuberay-operator-545586d46c-f9grr 1/1 Running 0 55m |
| 125 | +# test-cluster-head 2/2 Running 0 9m37s |
| 126 | +# test-cluster-small-wg-worker-j54xf 1/1 Running 0 88s |
146 | 127 | ``` |
147 | 128 |
|
148 | | -You can see that a worker node have been created. |
| 129 | +### Trigger RayCluster scale-down |
149 | 130 |
|
150 | | -## Trigger RayCluster scale-down |
151 | | - |
152 | | -Run: |
| 131 | +Run the following command to delete the actor we created earlier: |
153 | 132 |
|
154 | 133 | ```sh |
155 | 134 | curl -X POST 'localhost:31888/apis/v1/namespaces/default/jobs' \ |
156 | 135 | --header 'Content-Type: application/json' \ |
157 | 136 | --data '{ |
158 | 137 | "name": "delete-actor", |
159 | 138 | "namespace": "default", |
160 | | - "user": "boris", |
| 139 | + "user": "kuberay", |
161 | 140 | "entrypoint": "python /home/ray/samples/terminate_detached_actor.py actor1", |
162 | 141 | "clusterSelector": { |
163 | 142 | "ray.io/cluster": "test-cluster" |
164 | 143 | } |
165 | 144 | }' |
166 | 145 | ``` |
167 | 146 |
|
168 | | -A worker Pod will be deleted after `idleTimeoutSeconds` (default 60s, we specified 30) seconds. Run: |
| 147 | +Once the actor is deleted, the worker is no longer needed. The worker pod will be deleted |
| 148 | +after `idleTimeoutSeconds` (default 60; we specified 30) seconds. |
| 149 | + |
| 150 | +List all pods to verify that the worker pod is deleted: |
169 | 151 |
|
170 | | -```shell |
| 152 | +```sh |
171 | 153 | kubectl get pods |
| 154 | + |
| 155 | +# NAME READY STATUS RESTARTS AGE |
| 156 | +# create-actor-tsvfc 0/1 Completed 0 6m37s |
| 157 | +# delete-actor-89z8c 0/1 Completed 0 83s |
| 158 | +# kuberay-operator-545586d46c-f9grr 1/1 Running 0 60m |
| 159 | +# test-cluster-head 2/2 Running 0 14m |
| 160 | + |
172 | 161 | ``` |
173 | 162 |
|
174 | | -And you should see only head node (worker node is deleted) |
| 163 | +### Clean up |
175 | 164 |
|
176 | | -```shell |
177 | | -test-cluster-head-pr25j 2/2 Running 0 27m |
| 165 | +```sh |
| 166 | +make clean-cluster |
| 167 | +# Remove apiserver from helm |
| 168 | +helm uninstall kuberay-apiserver |
178 | 169 | ``` |
179 | 170 |
|
180 | 171 | [document]: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html |
|
0 commit comments