Skip to content

Commit 4dc2291

Browse files
committed
Merge branch 'master' into e2eautoscaler-deflaky-dead-actor-resources
2 parents 331d0a9 + 05b77e1 commit 4dc2291

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

54 files changed

+7704
-6706
lines changed

.buildkite/test-e2e.yml

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 30m -v ./test/e2erayservice 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-rayservice-log.tar -T - && exit 1)
3939
- echo "--- END:e2e rayservice (nightly operator) tests finished"
4040

41-
- label: 'Test Autoscaler E2E (nightly operator)'
41+
- label: 'Test Autoscaler E2E Part 1 (nightly operator)'
4242
instance_size: large
4343
image: golang:1.24
4444
commands:
@@ -50,13 +50,33 @@
5050
- bash ../.buildkite/build-start-operator.sh
5151
- kubectl wait --timeout=90s --for=condition=Available=true deployment kuberay-operator
5252
# Run e2e tests and print KubeRay operator logs if tests fail
53-
- echo "--- START:Running Autoscaler e2e (nightly operator) tests"
53+
- echo "--- START:Running Autoscaler E2E Part 1 (nightly operator) tests"
5454
- if [ -n "${KUBERAY_TEST_RAY_IMAGE}"]; then echo "Using Ray Image ${KUBERAY_TEST_RAY_IMAGE}"; fi
5555
- set -o pipefail
5656
- mkdir -p "$(pwd)/tmp" && export KUBERAY_TEST_OUTPUT_DIR=$(pwd)/tmp
5757
- echo "KUBERAY_TEST_OUTPUT_DIR=$$KUBERAY_TEST_OUTPUT_DIR"
58-
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 60m -v ./test/e2eautoscaler 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-autoscaler-log.tar -T - && exit 1)
59-
- echo "--- END:Autoscaler e2e (nightly operator) tests finished"
58+
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 60m -v ./test/e2eautoscaler/raycluster_autoscaler_test.go ./test/e2eautoscaler/support.go 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-autoscaler-log.tar -T - && exit 1)
59+
- echo "--- END:Autoscaler E2E Part 1 (nightly operator) tests finished"
60+
61+
- label: 'Test Autoscaler E2E Part 2 (nightly operator)'
62+
instance_size: large
63+
image: golang:1.24
64+
commands:
65+
- source .buildkite/setup-env.sh
66+
- kind create cluster --wait 900s --config ./ci/kind-config-buildkite.yml
67+
- kubectl config set clusters.kind-kind.server https://docker:6443
68+
# Build nightly KubeRay operator image
69+
- pushd ray-operator
70+
- bash ../.buildkite/build-start-operator.sh
71+
- kubectl wait --timeout=90s --for=condition=Available=true deployment kuberay-operator
72+
# Run e2e tests and print KubeRay operator logs if tests fail
73+
- echo "--- START:Running Autoscaler E2E Part 2 (nightly operator) tests"
74+
- if [ -n "${KUBERAY_TEST_RAY_IMAGE}"]; then echo "Using Ray Image ${KUBERAY_TEST_RAY_IMAGE}"; fi
75+
- set -o pipefail
76+
- mkdir -p "$(pwd)/tmp" && export KUBERAY_TEST_OUTPUT_DIR=$(pwd)/tmp
77+
- echo "KUBERAY_TEST_OUTPUT_DIR=$$KUBERAY_TEST_OUTPUT_DIR"
78+
- KUBERAY_TEST_TIMEOUT_SHORT=1m KUBERAY_TEST_TIMEOUT_MEDIUM=5m KUBERAY_TEST_TIMEOUT_LONG=10m go test -timeout 60m -v ./test/e2eautoscaler/raycluster_autoscaler_part2_test.go ./test/e2eautoscaler/support.go 2>&1 | awk -f ../.buildkite/format.awk | tee $$KUBERAY_TEST_OUTPUT_DIR/gotest.log || (kubectl logs --tail -1 -l app.kubernetes.io/name=kuberay | tee $$KUBERAY_TEST_OUTPUT_DIR/kuberay-operator.log && cd $$KUBERAY_TEST_OUTPUT_DIR && find . -name "*.log" | tar -cf /artifact-mount/e2e-autoscaler-log.tar -T - && exit 1)
79+
- echo "--- END:Autoscaler E2E Part 2 (nightly operator) tests finished"
6080

6181
- label: 'Test E2E Operator Version Upgrade (v1.3.0)'
6282
instance_size: large

.github/workflows/image-release.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,8 @@ jobs:
213213
run: echo "::set-output name=sha_short::$(git rev-parse --short HEAD)"
214214

215215
- name: Set up Docker
216-
uses: docker-practice/actions-setup-docker@master
216+
uses: docker/setup-docker-action@v4
217+
217218

218219
- name: Log in to Quay.io
219220
uses: docker/login-action@v2

apiserver/Autoscaling.md

Lines changed: 103 additions & 112 deletions
Original file line numberDiff line numberDiff line change
@@ -1,180 +1,171 @@
1-
# Creating Autoscaling clusters using API server
1+
# Creating Autoscaling clusters using APIServer
22

3-
One of the fundamental features of Ray is autoscaling. This [document] describes how to set up
4-
autoscaling using Ray operator. Here we will describe how to set it up using API server.
3+
One of Ray's key features is autoscaling. This [document] explains how to set up autoscaling
4+
with the Ray operator. Here, we demonstrate how to configure it using the APIServer and
5+
run an example.
56

6-
## Deploy KubeRay operator and API server
7+
## Setup
78

8-
Refer to [readme](README.md) for setting up KubRay operator and API server.
9+
Refer to the [README](README.md) for setting up the KubeRay operator and APIServer.
910

10-
```shell
11-
make operator-image cluster load-operator-image deploy-operator
11+
## Example
12+
13+
This example walks through how to trigger scale-up and scale-down for RayCluster.
14+
15+
Before proceeding with the example, remove any running RayClusters to ensure a successful
16+
execution of the steps below.
17+
18+
```sh
19+
kubectl delete raycluster --all
1220
```
1321

14-
Alternatively, you could build and deploy the Operator and API server from local repo for
15-
development purpose.
22+
> [!IMPORTANT]
23+
> All the following guidance requires you to switch your working directory to the KubeRay `apiserver`
24+
25+
### Install ConfigMap
1626

17-
```shell
18-
make operator-image cluster load-operator-image deploy-operator docker-image load-image deploy
27+
Install this [ConfigMap], which contains the code for our example. Simply download
28+
the file and run:
29+
30+
```sh
31+
kubectl apply -f test/cluster/cluster/detachedactor.yaml
1932
```
2033

21-
Additionally install this [ConfigMap] containing code that we will use for testing.
34+
Check if the ConfigMap is successfully created. You should see `ray-example` in the list:
35+
36+
```sh
37+
kubectl get configmaps
38+
# NAME DATA AGE
39+
# ray-example 2 8s
40+
```
2241

23-
## Deploy Ray cluster
42+
### Deploy RayCluster
2443

25-
Once they are set up, you first need to create a Ray cluster using the following commands:
44+
Before running the example, deploy a RayCluster with the following command:
2645

27-
```shell
46+
```sh
47+
# Create compute template
2848
curl -X POST 'localhost:31888/apis/v1/namespaces/default/compute_templates' \
2949
--header 'Content-Type: application/json' \
30-
--data '{
31-
"name": "default-template",
32-
"namespace": "default",
33-
"cpu": 2,
34-
"memory": 4
35-
}'
50+
--data @docs/api-example/compute_template.json
51+
52+
# Create RayCluster
3653
curl -X POST 'localhost:31888/apis/v1/namespaces/default/clusters' \
3754
--header 'Content-Type: application/json' \
38-
--data '{
39-
"name": "test-cluster",
40-
"namespace": "default",
41-
"user": "boris",
42-
"clusterSpec": {
43-
"enableInTreeAutoscaling": true,
44-
"autoscalerOptions": {
45-
"upscalingMode": "Default",
46-
"idleTimeoutSeconds": 30,
47-
"cpu": "500m",
48-
"memory": "512Mi"
49-
},
50-
"headGroupSpec": {
51-
"computeTemplate": "default-template",
52-
"image": "rayproject/ray:2.9.0-py310",
53-
"serviceType": "NodePort",
54-
"rayStartParams": {
55-
"dashboard-host": "0.0.0.0",
56-
"metrics-export-port": "8080",
57-
"num-cpus": "0"
58-
},
59-
"volumes": [
60-
{
61-
"name": "code-sample",
62-
"mountPath": "/home/ray/samples",
63-
"volumeType": "CONFIGMAP",
64-
"source": "ray-example",
65-
"items": {
66-
"detached_actor.py": "detached_actor.py",
67-
"terminate_detached_actor.py": "terminate_detached_actor.py"
68-
}
69-
}
70-
]
71-
},
72-
"workerGroupSpec": [
73-
{
74-
"groupName": "small-wg",
75-
"computeTemplate": "default-template",
76-
"image": "rayproject/ray:2.9.0-py310",
77-
"replicas": 0,
78-
"minReplicas": 0,
79-
"maxReplicas": 5,
80-
"rayStartParams": {
81-
"node-ip-address": "$MY_POD_IP"
82-
},
83-
"volumes": [
84-
{
85-
"name": "code-sample",
86-
"mountPath": "/home/ray/samples",
87-
"volumeType": "CONFIGMAP",
88-
"source": "ray-example",
89-
"items": {
90-
"detached_actor.py": "detached_actor.py",
91-
"terminate_detached_actor.py": "terminate_detached_actor.py"
92-
}
93-
}
94-
]
95-
}
96-
]
97-
}
98-
}'
55+
--data @docs/api-example/autoscaling_clusters.json
9956
```
10057

101-
## Validate that Ray cluster is deployed correctly
58+
This command performs two main operations:
10259

103-
Run:
60+
1. Creates a compute template `default-template` that specifies resources to use during
61+
scale-up (2 CPUs and 4 GiB memory).
10462

105-
```shell
106-
kubectl get pods
107-
```
63+
2. Deploys a RayCluster (test-cluster) with:
64+
- A head pod that manages the cluster
65+
- A worker group configured to scale between 0 and 5 replicas
66+
67+
The worker group uses the following autoscalerOptions to control scaling behavior:
68+
69+
- **`upscalingMode: "Default"`**: Default scaling behavior. Ray will scale up only as
70+
needed.
71+
- **`idleTimeoutSeconds: 30`**: If a worker pod remains idle (i.e., not running any tasks)
72+
for 30 seconds, it will be automatically removed.
73+
- **`cpu: "500m"`, `memory: "512Mi"`**: Defines the **minimum resource unit** Ray uses to
74+
assess scaling needs. If no worker pod has at least this much free capacity, Ray will
75+
trigger a scale-up and launch a new worker pod.
76+
77+
> **Note:** These values **do not determine the actual size** of the worker pod. The
78+
> pod size comes from the `computeTemplate` (in this case, 2 CPUs and 4 GiB memory).
10879
109-
You should get something like this:
80+
### Validate that RayCluster is deployed correctly
11081

111-
```shell
112-
test-cluster-head-pr25j 2/2 Running 0 2m49s
82+
Run the following command to get a list of pods running. You should see something like below:
83+
84+
```sh
85+
kubectl get pods
86+
# NAME READY STATUS RESTARTS AGE
87+
# kuberay-operator-545586d46c-f9grr 1/1 Running 0 49m
88+
# test-cluster-head 2/2 Running 0 3m1s
11389
```
11490

115-
Note that only head pod is running and it has 2 containers
91+
Note that there is no worker for `test-cluster` as we set its initial replicas to 0. You
92+
will only see head pod with 2 containers for `test-cluster`.
11693

117-
## Trigger RayCluster scale-up
94+
### Trigger RayCluster scale-up
11895

119-
Create a detached actor:
96+
Create a detached actor to trigger scale-up with the following command:
12097

12198
```sh
12299
curl -X POST 'localhost:31888/apis/v1/namespaces/default/jobs' \
123100
--header 'Content-Type: application/json' \
124101
--data '{
125102
"name": "create-actor",
126103
"namespace": "default",
127-
"user": "boris",
104+
"user": "kuberay",
128105
"entrypoint": "python /home/ray/samples/detached_actor.py actor1",
129106
"clusterSelector": {
130107
"ray.io/cluster": "test-cluster"
131108
}
132109
}'
133110
```
134111

135-
Because we have specified `num_cpu: 0` for head node, this will cause creation of a worker node. Run:
112+
The `detached_actor.py` file is defined in the [ConfigMap] we installed earlier and
113+
mounted to the head node, which requires `num_cpus=1`. Recall that initially there is no
114+
worker pod exists, RayCluster needs to scale up a worker for running this actor.
136115

137-
```shell
138-
kubectl get pods
139-
```
116+
Check if a worker is created. You should see a worker `test-cluster-small-wg-worker` spin
117+
up.
140118

141-
You should get something like this:
119+
```sh
120+
kubectl get pods
142121

143-
```shell
144-
test-cluster-head-pr25j 2/2 Running 0 15m
145-
test-cluster-worker-small-wg-qrjfm 1/1 Running 0 2m48s
122+
# NAME READY STATUS RESTARTS AGE
123+
# create-actor-tsvfc 0/1 Completed 0 99s
124+
# kuberay-operator-545586d46c-f9grr 1/1 Running 0 55m
125+
# test-cluster-head 2/2 Running 0 9m37s
126+
# test-cluster-small-wg-worker-j54xf 1/1 Running 0 88s
146127
```
147128

148-
You can see that a worker node have been created.
129+
### Trigger RayCluster scale-down
149130

150-
## Trigger RayCluster scale-down
151-
152-
Run:
131+
Run the following command to delete the actor we created earlier:
153132

154133
```sh
155134
curl -X POST 'localhost:31888/apis/v1/namespaces/default/jobs' \
156135
--header 'Content-Type: application/json' \
157136
--data '{
158137
"name": "delete-actor",
159138
"namespace": "default",
160-
"user": "boris",
139+
"user": "kuberay",
161140
"entrypoint": "python /home/ray/samples/terminate_detached_actor.py actor1",
162141
"clusterSelector": {
163142
"ray.io/cluster": "test-cluster"
164143
}
165144
}'
166145
```
167146

168-
A worker Pod will be deleted after `idleTimeoutSeconds` (default 60s, we specified 30) seconds. Run:
147+
Once the actor is deleted, the worker is no longer needed. The worker pod will be deleted
148+
after `idleTimeoutSeconds` (default 60; we specified 30) seconds.
149+
150+
List all pods to verify that the worker pod is deleted:
169151

170-
```shell
152+
```sh
171153
kubectl get pods
154+
155+
# NAME READY STATUS RESTARTS AGE
156+
# create-actor-tsvfc 0/1 Completed 0 6m37s
157+
# delete-actor-89z8c 0/1 Completed 0 83s
158+
# kuberay-operator-545586d46c-f9grr 1/1 Running 0 60m
159+
# test-cluster-head 2/2 Running 0 14m
160+
172161
```
173162

174-
And you should see only head node (worker node is deleted)
163+
### Clean up
175164

176-
```shell
177-
test-cluster-head-pr25j 2/2 Running 0 27m
165+
```sh
166+
make clean-cluster
167+
# Remove apiserver from helm
168+
helm uninstall kuberay-apiserver
178169
```
179170

180171
[document]: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/configuring-autoscaling.html

0 commit comments

Comments
 (0)