Skip to content

Commit 3f5d22d

Browse files
committed
document setup of slack cluster queue
1 parent a26ecdd commit 3f5d22d

9 files changed

+139
-3
lines changed

setup.k8s-v1.25/CLUSTER-SETUP.md

+43
Original file line numberDiff line numberDiff line change
@@ -98,3 +98,46 @@ Create `mlbatch-edit` role:
9898
```sh
9999
kubectl apply -f setup.k8s-v1.25/mlbatch-edit-role.yaml
100100
```
101+
102+
## Slack Cluster Queue
103+
104+
Create the designated slack `ClusterQueue` which will be used to automate
105+
minor adjustments to cluster capacity caused by node failures and
106+
scheduler maintanence.
107+
```sh
108+
kubectl apply -f- << EOF
109+
apiVersion: kueue.x-k8s.io/v1beta1
110+
kind: ClusterQueue
111+
metadata:
112+
name: slack-cluster-queue
113+
spec:
114+
namespaceSelector: {}
115+
cohort: default-cohort
116+
preemption:
117+
withinClusterQueue: LowerOrNewerEqualPriority
118+
reclaimWithinCohort: Any
119+
borrowWithinCohort:
120+
policy: Never
121+
resourceGroups:
122+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
123+
flavors:
124+
- name: default-flavor
125+
resources:
126+
- name: "cpu"
127+
nominalQuota: 8000m
128+
- name: "memory"
129+
nominalQuota: 128Gi
130+
- name: "nvidia.com/gpu"
131+
nominalQuota: 8
132+
- name: "nvidia.com/roce_gdr"
133+
nominalQuota: 1
134+
- name: "pods"
135+
nominalQuota: 100
136+
EOF
137+
```
138+
Edit the above quantities to adjust the quota to the desired
139+
values. Pod counts are optional and can be omitted from the list of
140+
covered resources. The `lendingLimit` for each resource will be
141+
dynamically adjusted by the MLBatch system to reflect reduced cluster
142+
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
143+
detailed discussion of the role of the slack `ClusterQueue`.

setup.k8s-v1.25/appwrapper/config_patch.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ data:
1313
enable: false
1414
defaultQueueName: default-queue
1515
schedulerName: scheduler-plugins-scheduler
16+
slackQueueName: slack-cluster-queue
1617
userRBACAdmissionCheck: false
1718
controllerManager:
1819
health:

setup.k8s-v1.30/CLUSTER-SETUP.md

+43
Original file line numberDiff line numberDiff line change
@@ -104,3 +104,46 @@ will have local queue names and thus be subject to Kueue's quota management.
104104
```sh
105105
kubectl apply -f setup.k8s-v1.30/admission-policy.yaml
106106
```
107+
108+
## Slack Cluster Queue
109+
110+
Create the designated slack `ClusterQueue` which will be used to automate
111+
minor adjustments to cluster capacity caused by node failures and
112+
scheduler maintanence.
113+
```sh
114+
kubectl apply -f- << EOF
115+
apiVersion: kueue.x-k8s.io/v1beta1
116+
kind: ClusterQueue
117+
metadata:
118+
name: slack-cluster-queue
119+
spec:
120+
namespaceSelector: {}
121+
cohort: default-cohort
122+
preemption:
123+
withinClusterQueue: LowerOrNewerEqualPriority
124+
reclaimWithinCohort: Any
125+
borrowWithinCohort:
126+
policy: Never
127+
resourceGroups:
128+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
129+
flavors:
130+
- name: default-flavor
131+
resources:
132+
- name: "cpu"
133+
nominalQuota: 8000m
134+
- name: "memory"
135+
nominalQuota: 128Gi
136+
- name: "nvidia.com/gpu"
137+
nominalQuota: 8
138+
- name: "nvidia.com/roce_gdr"
139+
nominalQuota: 1
140+
- name: "pods"
141+
nominalQuota: 100
142+
EOF
143+
```
144+
Edit the above quantities to adjust the quota to the desired
145+
values. Pod counts are optional and can be omitted from the list of
146+
covered resources. The `lendingLimit` for each resource will be
147+
dynamically adjusted by the MLBatch system to reflect reduced cluster
148+
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
149+
detailed discussion of the role of the slack `ClusterQueue`.

setup.k8s-v1.30/appwrapper/config_patch.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ data:
1313
enable: false
1414
defaultQueueName: default-queue
1515
schedulerName: scheduler-plugins-scheduler
16+
slackQueueName: slack-cluster-queue
1617
userRBACAdmissionCheck: false
1718
controllerManager:
1819
health:

setup.tmpl/CLUSTER-SETUP.md.tmpl

+46
Original file line numberDiff line numberDiff line change
@@ -196,3 +196,49 @@ will have local queue names and thus be subject to Kueue's quota management.
196196
{{ .KUBECTL }} apply -f setup.{{ .VERSION }}/admission-policy.yaml
197197
```
198198
{{- end }}
199+
200+
{{- if .SLACKCQ }}
201+
202+
## Slack Cluster Queue
203+
204+
Create the designated slack `ClusterQueue` which will be used to automate
205+
minor adjustments to cluster capacity caused by node failures and
206+
scheduler maintanence.
207+
```sh
208+
{{ .KUBECTL }} apply -f- << EOF
209+
apiVersion: kueue.x-k8s.io/v1beta1
210+
kind: ClusterQueue
211+
metadata:
212+
name: slack-cluster-queue
213+
spec:
214+
namespaceSelector: {}
215+
cohort: default-cohort
216+
preemption:
217+
withinClusterQueue: LowerOrNewerEqualPriority
218+
reclaimWithinCohort: Any
219+
borrowWithinCohort:
220+
policy: Never
221+
resourceGroups:
222+
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
223+
flavors:
224+
- name: default-flavor
225+
resources:
226+
- name: "cpu"
227+
nominalQuota: 8000m
228+
- name: "memory"
229+
nominalQuota: 128Gi
230+
- name: "nvidia.com/gpu"
231+
nominalQuota: 8
232+
- name: "nvidia.com/roce_gdr"
233+
nominalQuota: 1
234+
- name: "pods"
235+
nominalQuota: 100
236+
EOF
237+
```
238+
Edit the above quantities to adjust the quota to the desired
239+
values. Pod counts are optional and can be omitted from the list of
240+
covered resources. The `lendingLimit` for each resource will be
241+
dynamically adjusted by the MLBatch system to reflect reduced cluster
242+
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
243+
detailed discussion of the role of the slack `ClusterQueue`.
244+
{{- end }}

setup.tmpl/Kubernetes-v1.25.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ OPENSHIFT: false
44
VERSION: k8s-v1.25
55
KUBECTL: kubectl
66
VAP: false
7+
SLACKCQ: true

setup.tmpl/Kubernetes-v1.30.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,4 @@ OPENSHIFT: false
44
VERSION: k8s-v1.30
55
KUBECTL: kubectl
66
VAP: true
7+
SLACKCQ: true

setup.tmpl/RHOAI-v2.10.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22

33
OPENSHIFT: true
44
VERSION: RHOAI-v2.10
5-
KUBECTL: oc
5+
KUBECTL: oc
6+
SLACKCQ: false

setup.tmpl/RHOAI-v2.11.yaml

+1-2
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,4 @@
33
OPENSHIFT: true
44
VERSION: RHOAI-v2.11
55
KUBECTL: oc
6-
7-
6+
SLACKCQ: false

0 commit comments

Comments
 (0)