1
1
# MLBatch Setup
2
2
3
3
The MLBatch setup consists of a [ cluster setup] ( #cluster-setup ) to be done once
4
- and a [ project setup] ( #project -setup ) to be repeated for each team. This
5
- document also discusses [ quota maintenance] ( #quota-maintenance ) .
4
+ and a [ team setup] ( #team -setup ) to be repeated for each team that will
5
+ be using the cluster. This document also discusses [ quota maintenance] ( #quota-maintenance ) .
6
6
7
7
Batch users should only be permitted to create AppWrappers or workloads whose
8
- types are natively supported by Kueue. The provided ` mlbatch-edit ` role permits
9
- the creation of ` PyTorchJobs ` , ` RayJobs ` , ` RayClusters ` , and ` AppWrappers ` .
10
- Kueue at this time has no mechanism for granular quota enforcement for ` Jobs ` ,
11
- i.e., no mechanism to enforce quotas only on user-submitted ` Jobs ` without
12
- impacting OpenShift-internal ` Jobs ` . As a consequence, MLBatch disables queuing
13
- and quota management for ` Jobs ` and the ` mlbatch-edit ` role does not give
14
- permission to create ` Jobs ` . While ` Jobs ` , or ` Pods ` and ` Deployments ` cannot be
15
- created by MLBatch users directly, ` AppWrappers ` can easily wrap and bundle
16
- resources of these types. See [ USAGE.md] ( USAGE.md ) for examples.
8
+ types are natively supported by Kueue. The cluster setup set defines a
9
+ ` mlbatch-edit ` role which enforces these restrictions and will be used in
10
+ the setup process for each team of MLBatch users that is onboarded.
17
11
18
- This setup has been developed on OpenShift 4.14 is intended to support OpenShift
19
- 4.12 and up.
12
+ This setup has been developed on OpenShift 4.14 and Kubernetes 1.27 and
13
+ is intended to support OpenShift 4.12 and up and/or Kubernetes 1.25 and up.
20
14
21
15
To start with, recursively clone and enter this repository:
22
16
``` sh
@@ -26,199 +20,20 @@ cd mlbatch
26
20
27
21
## Cluster Setup
28
22
29
- The cluster setup installs OpenShift AI and Coscheduler, configures Kueue,
30
- cluster roles, and priority classes.
23
+ Step by step setup instructions are provided for the following versions:
24
+ + [ OpenShift AI 2.10] ( ./setup.RHOAI-v2.10/CLUSTER-SETUP.md )
25
+ + [ OpenShift AI 2.11] ( ./setup.RHOAI-v2.11/CLUSTER-SETUP.md )
26
+ + [ Kubernetes 1.25+] ( ./setup.k8s-v1.25/CLUSTER-SETUP.md )
31
27
32
- If MLBatch is deployed on a cluster that used to run earlier versions of ODH,
33
- [ MCAD] ( https://github.com/project-codeflare/mcad ) , OpenShift AI, or Coscheduler,
34
- make sure to scrub traces of these installations. In particular, make sure to
35
- delete the following custom resource definitions (CRD) if present on the
36
- cluster. Make sure to delete all instances prior to deleting the CRDs:
37
- ``` sh
38
- # Delete old appwrappers and crd
39
- oc delete appwrappers --all -A
40
- oc delete crd appwrappers.workload.codeflare.dev
41
-
42
- # Delete old noderesourcetopologies and crd
43
- oc delete noderesourcetopologies --all -A
44
- oc delete crd noderesourcetopologies.topology.node.k8s.io
45
- ```
46
-
47
- ### Priorities
48
-
49
- Create ` default-priority ` , ` high-priority ` , and ` low-priority ` priority classes:
50
- ``` sh
51
- oc apply -f setup/mlbatch-priorities.yaml
52
- ```
53
-
54
- ### Coscheduler
55
-
56
- Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
57
- ``` sh
58
- helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
59
- scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
60
- --set-json pluginConfig=' [{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]'
61
- ```
62
- Patch Coscheduler pod priorities:
63
- ``` sh
64
- oc patch deployment -n scheduler-plugins --type=json --patch-file setup/coscheduler-priority-patch.yaml scheduler-plugins-controller
65
- oc patch deployment -n scheduler-plugins --type=json --patch-file setup/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
66
- ```
67
-
68
- ### OpenShift AI
69
-
70
- Create OpenShift AI 2.10 subscription:
71
- ``` sh
72
- oc apply -f setup/mlbatch-subscription.yaml
73
- ````
74
- Identify install plan:
75
- ` ` ` sh
76
- oc get ip -n redhat-ods-operator
77
- ` ` `
78
- ```
79
- NAMESPACE NAME CSV APPROVAL APPROVED
80
- redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false
81
- ```
82
- Approve install plan replacing the generated plan name below with the actual
83
- value:
84
- ```sh
85
- oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
86
- ```
87
- Create DSC Initialization:
88
- ``` sh
89
- oc apply -f setup/mlbatch-dsci.yaml
90
- ```
91
- Create Data Science Cluster:
92
- ``` sh
93
- oc apply -f setup/mlbatch-dsc.yaml
94
- ```
95
- The provided configuration differs from the default OpenShift AI configuration
96
- as follows:
97
- - Kubeflow Training Operator:
98
- - ` gang-scheduler-name ` is set to ` scheduler-plugins-scheduler ` ,
99
- - Kueue:
100
- - ` manageJobsWithoutQueueName ` is enabled,
101
- - ` batch/job ` integration is disabled,
102
- - ` waitForPodsReady ` is disabled,
103
- - Codeflare operator:
104
- - the AppWrapper controller is enabled and configured as follows:
105
- - ` userRBACAdmissionCheck ` is disabled,
106
- - ` schedulerName ` is set to ` scheduler-plugins-scheduler ` ,
107
- - ` queueName ` is set to ` default-queue ` ,
108
- - pod priorities, resource requests and limits have been adjusted.
109
-
110
- To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition
111
- in OpenShift AI 2.10 installation), do a rolling restart of the Kueue manager.
112
- ``` sh
113
- oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications
114
- ```
28
+ ## Team Setup
115
29
116
- After doing the restart, verify that you see the following lines in the
117
- kueue-controller-manager's log:
118
- ``` sh
119
- {" level" :" info" ," ts" :" 2024-06-25T20:17:25.689638786Z" ," logger" :" controller-runtime.builder" ," caller" :" builder/webhook.go:189" ," msg" :" Registering a validating webhook" ," GVK" :" kubeflow.org/v1, Kind=PyTorchJob" ," path" :" /validate-kubeflow-org-v1-pytorchjob" }
120
- {" level" :" info" ," ts" :" 2024-06-25T20:17:25.689698615Z" ," logger" :" controller-runtime.webhook" ," caller" :" webhook/server.go:183" ," msg" :" Registering webhook" ," path" :" /validate-kubeflow-org-v1-pytorchjob" }
121
- {" level" :" info" ," ts" :" 2024-06-25T20:17:25.689743757Z" ," logger" :" setup" ," caller" :" jobframework/setup.go:81" ," msg" :" Set up controller and webhook for job framework" ," jobFrameworkName" :" kubeflow.org/pytorchjob" }
122
-
123
- ```
124
-
125
- ### Kueue Configuration
30
+ To onboard a team to the cluster, a cluster admin will create and configure
31
+ an OpenShift project (or Kubernetes namespace) for the team.
126
32
127
- Create Kueue's default flavor:
128
- ``` sh
129
- oc apply -f setup/default-flavor.yaml
130
- ```
131
-
132
- ### Cluster Role
133
-
134
- Create ` mlbatch-edit ` role:
135
- ``` sh
136
- oc apply -f setup/mlbatch-edit-role.yaml
137
- ```
138
-
139
- ## Project Setup
140
-
141
- The project setup creates a project, a user group, a quota, a queue, and the
142
- required role bindings.
143
-
144
- Create project:
145
- ``` sh
146
- oc new-project team1
147
- ```
148
- Create user group:
149
- ``` sh
150
- oc adm groups new team1-edit-group
151
- ```
152
- Add users to group for example:
153
- ``` sh
154
- oc adm groups add-users team1-edit-group user1
155
- ```
156
- Bind cluster role to group in namespace:
157
- ``` sh
158
- oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace=" " --namespace team1
159
- ```
160
- Specify the intended quota for the namespace by creating a ` ClusterQueue ` :
161
- ``` sh
162
- oc apply -f- << EOF
163
- apiVersion: kueue.x-k8s.io/v1beta1
164
- kind: ClusterQueue
165
- metadata:
166
- name: team1-cluster-queue
167
- spec:
168
- namespaceSelector: {}
169
- cohort: default-cohort
170
- preemption:
171
- withinClusterQueue: LowerOrNewerEqualPriority
172
- reclaimWithinCohort: Any
173
- borrowWithinCohort:
174
- policy: Never
175
- resourceGroups:
176
- - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
177
- flavors:
178
- - name: default-flavor
179
- resources:
180
- - name: "cpu"
181
- nominalQuota: 8000m
182
- # borrowingLimit: 0
183
- # lendingLimit: 0
184
- - name: "memory"
185
- nominalQuota: 128Gi
186
- # borrowingLimit: 0
187
- # lendingLimit: 0
188
- - name: "nvidia.com/gpu"
189
- nominalQuota: 16
190
- # borrowingLimit: 0
191
- # lendingLimit: 0
192
- - name: "nvidia.com/roce_gdr"
193
- nominalQuota: 4
194
- # borrowingLimit: 0
195
- # lendingLimit: 0
196
- - name: "pods"
197
- nominalQuota: 100
198
- # borrowingLimit: 0
199
- # lendingLimit: 0
200
- EOF
201
- ```
202
- Edit the above quantities to adjust the quota to the desired values. Pod counts
203
- are optional and can be omitted from the list of covered resources.
204
-
205
- Uncomment all ` borrowingLimit ` lines to prevent this namespace from borrowing
206
- quota from other namespaces. Uncomment all ` lendingLimit ` lines to prevent other
207
- namespaces from borrowing quota from this namespace.
208
-
209
- Create a ` LocalQueue ` to bind the ` ClusterQueue ` to the namespace:
210
- ``` sh
211
- oc apply -n team1 -f- << EOF
212
- apiVersion: kueue.x-k8s.io/v1beta1
213
- kind: LocalQueue
214
- metadata:
215
- name: default-queue
216
- spec:
217
- clusterQueue: team1-cluster-queue
218
- EOF
219
- ```
220
- We recommend naming the local queue ` default-queue ` as ` AppWrappers ` will
221
- default to this queue name.
33
+ Step by step setup instructions are provided for the following versions:
34
+ + [ OpenShift AI 2.10] ( ./setup.RHOAI-v2.10/TEAM-SETUP.md )
35
+ + [ OpenShift AI 2.11] ( ./setup.RHOAI-v2.11/TEAM-SETUP.md )
36
+ + [ Kubernetes 1.25+] ( ./setup.k8s-v1.25/TEAM-SETUP.md )
222
37
223
38
## Quota Maintenance
224
39
@@ -265,22 +80,10 @@ requesting:
265
80
266
81
## Cleanup
267
82
268
- To uninstall the MLBatch controllers and reclaim the corresponding namespaces,
269
- run:
270
- ` ` ` sh
271
- # OpenShift AI uninstall
272
- oc delete dsc mlbatch-dsc
273
- oc delete dsci mlbatch-dsci
274
- oc delete subscription -n redhat-ods-operator rhods-operator
275
- oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
276
- oc delete crd featuretrackers.features.opendatahub.io \
277
- dscinitializations.dscinitialization.opendatahub.io \
278
- datascienceclusters.datasciencecluster.opendatahub.io
279
- oc delete operators rhods-operator.redhat-ods-operator
280
- oc delete operatorgroup -n redhat-ods-operator rhods-operator
281
- oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
83
+ First, remove all team projects/namespaces and corresponding cluster queues.
282
84
283
- # Coscheduler uninstall
284
- helm uninstall -n scheduler-plugins scheduler-plugins
285
- oc delete namespace scheduler-plugins
286
- ```
85
+ Second, follow the version specific instructions to uninstall the MLBatch controllers
86
+ and reclaim the corresponding namespaces.
87
+ + [OpenShift AI 2.10](./setup.RHOAI-v2.10/UNINSTALL.md)
88
+ + [OpenShift AI 2.11](./setup.RHOAI-v2.11/UNINSTALL.md)
89
+ + [Kubernetes 1.25+](./setup.k8s-v1.25/UNINSTALL.md)
0 commit comments