-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
first cut at k8s 1.30 VAP instructions
- Loading branch information
1 parent
e6635f0
commit 03e737c
Showing
30 changed files
with
705 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# Cluster Setup | ||
|
||
The cluster setup installs and configures the following components: | ||
+ Coscheduler | ||
+ Kubeflow Training Operator | ||
+ KubeRay | ||
+ Kueue | ||
+ AppWrappers | ||
+ Cluster roles and priority classes | ||
|
||
If MLBatch is deployed on a cluster that used to run earlier versions of ODH, | ||
[MCAD](https://github.com/project-codeflare/mcad), or Coscheduler, | ||
make sure to scrub traces of these installations. In particular, make sure to | ||
delete the following custom resource definitions (CRD) if present on the | ||
cluster. Make sure to delete all instances prior to deleting the CRDs: | ||
```sh | ||
# Delete old appwrappers and crd | ||
kubectl delete appwrappers --all -A | ||
kubectl delete crd appwrappers.workload.codeflare.dev | ||
|
||
# Delete old noderesourcetopologies and crd | ||
kubectl delete noderesourcetopologies --all -A | ||
kubectl delete crd noderesourcetopologies.topology.node.k8s.io | ||
``` | ||
|
||
## Priorities | ||
|
||
Create `default-priority`, `high-priority`, and `low-priority` priority classes: | ||
```sh | ||
kubectl apply -f setup.k8s-v1.30/mlbatch-priorities.yaml | ||
``` | ||
|
||
## Coscheduler | ||
|
||
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing: | ||
```sh | ||
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \ | ||
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \ | ||
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]' | ||
``` | ||
Patch Coscheduler pod priorities: | ||
```sh | ||
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s-v1.30/coscheduler-priority-patch.yaml scheduler-plugins-controller | ||
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s-v1.30/coscheduler-priority-patch.yaml scheduler-plugins-scheduler | ||
``` | ||
|
||
## Install Operators | ||
|
||
Create the mlbatch-system namespace | ||
```sh | ||
kubectl create namespace mlbatch-system | ||
``` | ||
|
||
Install the Kubeflow Training Operator | ||
```sh | ||
kubectl apply --server-side -k setup.k8s-v1.30/training-operator | ||
``` | ||
|
||
Install the KubeRay Operator | ||
```sh | ||
kubectl apply --server-side -k setup.k8s-v1.30/kuberay | ||
``` | ||
|
||
Install Kueue | ||
```sh | ||
kubectl apply --server-side -k setup.k8s-v1.30/kueue | ||
``` | ||
|
||
Install the AppWrapper Operator | ||
```sh | ||
kubectl apply --server-side -k setup.k8s-v1.30/appwrapper | ||
``` | ||
The provided configuration differs from the default configuration of the | ||
operators as follows: | ||
- Kubeflow Training Operator: | ||
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`, | ||
- Kueue: | ||
- `waitForPodsReady` is disabled, | ||
- AppWrapper operator: | ||
- `userRBACAdmissionCheck` is disabled, | ||
- `schedulerName` is set to `scheduler-plugins-scheduler`, | ||
- `queueName` is set to `default-queue`, | ||
- pod priorities, resource requests and limits have been adjusted. | ||
|
||
## Kueue Configuration | ||
|
||
Create Kueue's default flavor: | ||
```sh | ||
kubectl apply -f setup.k8s-v1.30/default-flavor.yaml | ||
``` | ||
|
||
## Cluster Role | ||
|
||
Create `mlbatch-edit` role: | ||
```sh | ||
kubectl apply -f setup.k8s-v1.30/mlbatch-edit-role.yaml | ||
``` | ||
## Validating Admission Policy | ||
|
||
Create a validating admission policy that works with the mlbatch-edit role to | ||
ensure that all pod-creating resources created in team namespaces will be properly | ||
tracked for quota usage. | ||
```sh | ||
kubectl apply -f setup.k8s-v1.30/admission-policy.yaml | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Team Setup | ||
|
||
A *team* in MLBatch is a group of users that share a resource quota. | ||
|
||
Setting up a new team requires the cluster admin to create a namespace, | ||
a quota, a queue, and the required role bindings as described below. | ||
|
||
Create namespace: | ||
```sh | ||
kubectl create namespace team1 | ||
``` | ||
|
||
For each user on the team, create a RoleBinding: | ||
```sh | ||
kubectl -n team 1 apply -f- << EOF | ||
kind: RoleBinding | ||
apiVersion: rbac.authorization.k8s.io/v1 | ||
metadata: | ||
name: user-one | ||
subjects: | ||
- kind: User | ||
apiGroup: rbac.authorization.k8s.io | ||
name: user-one | ||
roleRef: | ||
apiGroup: rbac.authorization.k8s.io | ||
kind: ClusterRole | ||
name: mlbatch-edit | ||
EOF | ||
``` | ||
|
||
Specify the intended quota for the namespace by creating a `ClusterQueue`: | ||
```sh | ||
kubectl apply -f- << EOF | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: ClusterQueue | ||
metadata: | ||
name: team1-cluster-queue | ||
spec: | ||
namespaceSelector: {} | ||
cohort: default-cohort | ||
preemption: | ||
withinClusterQueue: LowerOrNewerEqualPriority | ||
reclaimWithinCohort: Any | ||
borrowWithinCohort: | ||
policy: Never | ||
resourceGroups: | ||
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"] | ||
flavors: | ||
- name: default-flavor | ||
resources: | ||
- name: "cpu" | ||
nominalQuota: 8000m | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "memory" | ||
nominalQuota: 128Gi | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "nvidia.com/gpu" | ||
nominalQuota: 16 | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "nvidia.com/roce_gdr" | ||
nominalQuota: 4 | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
- name: "pods" | ||
nominalQuota: 100 | ||
# borrowingLimit: 0 | ||
# lendingLimit: 0 | ||
EOF | ||
``` | ||
Edit the above quantities to adjust the quota to the desired values. Pod counts | ||
are optional and can be omitted from the list of covered resources. | ||
|
||
Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing | ||
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other | ||
namespaces from borrowing quota from this namespace. | ||
|
||
Create a `LocalQueue` to bind the `ClusterQueue` to the namespace: | ||
```sh | ||
kubectl apply -n team1 -f- << EOF | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: LocalQueue | ||
metadata: | ||
name: default-queue | ||
spec: | ||
clusterQueue: team1-cluster-queue | ||
EOF | ||
``` | ||
We recommend naming the local queue `default-queue` as `AppWrappers` will | ||
default to this queue name. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Uninstall | ||
|
||
***First, remove all team namespaces and corresponding cluster queues.*** | ||
|
||
Then to uninstall the MLBatch controllers and reclaim the corresponding | ||
namespaces, do the following: | ||
```sh | ||
# Delete operators and CRDs | ||
kubectl delete -k setup.k8s-v1.25/appwrapper | ||
kubectl delete -k setup.k8s-v1.25/kueue | ||
kubectl delete -k setup.k8s-v1.25/kuberay | ||
kubectl delete -k setup.k8s-v1.25/training-operator | ||
|
||
# Delete namespace | ||
kubectl delete namespace mlbatch-system | ||
|
||
# Delete clusterole | ||
kubectl delete clusterrole mlbatch-edit | ||
|
||
# Coscheduler uninstall | ||
helm uninstall -n scheduler-plugins scheduler-plugins | ||
kubectl delete namespace scheduler-plugins | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
apiVersion: admissionregistration.k8s.io/v1 | ||
kind: ValidatingAdmissionPolicy | ||
metadata: | ||
name: mlbatch-require-queue-name | ||
spec: | ||
failurePolicy: Fail | ||
matchConstraints: | ||
resourceRules: | ||
- apiGroups: ["batch"] | ||
apiVersions: ["v1"] | ||
resources: ["jobs"] | ||
operations: ["CREATE", "UPDATE"] | ||
- apiGroups: ["kubeflow.org"] | ||
apiVersions: ["v1"] | ||
operations: ["CREATE", "UPDATE"] | ||
resources: ["pytorchjobs"] | ||
- apiGroups: ["cluster.ray.io"] | ||
apiVersions: ["v1"] | ||
operations: ["CREATE", "UPDATE"] | ||
resources: ["rayjobs","rayclusters"] | ||
validations: | ||
- expression: "'kueue.x-k8s.io/queue-name' in object.metadata.labels && object.metadata.labels['kueue.x-k8s.io/queue-name'] != ''" | ||
message: "The label 'kueue.x-k8s.io/queue-name' is either missing or does not have a value set." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
kind: ConfigMap | ||
apiVersion: v1 | ||
metadata: | ||
name: appwrapper-operator-config | ||
namespace: appwrapper-system | ||
data: | ||
config.yaml: | | ||
appwrapper: | ||
enableKueueIntegrations: true | ||
kueueJobReconciller: | ||
manageJobsWithoutQueueName: false | ||
waitForPodsReady: | ||
enable: false | ||
defaultQueueName: default-queue | ||
schedulerName: scheduler-plugins-scheduler | ||
userRBACAdmissionCheck: false | ||
controllerManager: | ||
health: | ||
bindAddress: ":8081" | ||
metrics: | ||
bindAddress: "127.0.0.1:8080" | ||
leaderElection: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
namespace: mlbatch-system | ||
|
||
resources: | ||
- "https://github.com/project-codeflare/appwrapper/config/default?ref=v0.21.1" | ||
|
||
labels: | ||
- pairs: | ||
app.kubernetes.io/name: appwrapper | ||
app.kubernetes.io/component: controller | ||
includeSelectors: true | ||
|
||
images: | ||
- name: quay.io/ibm/appwrapper | ||
newTag: v0.21.1 | ||
|
||
patches: | ||
- path: config_patch.yaml | ||
- path: manager_resources_patch.yaml | ||
- path: remove_default_namespace.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
apiVersion: apps/v1 | ||
kind: Deployment | ||
metadata: | ||
name: controller-manager | ||
namespace: system | ||
spec: | ||
template: | ||
spec: | ||
priorityClassName: system-node-critical | ||
containers: | ||
- name: manager | ||
resources: | ||
requests: | ||
cpu: 250m | ||
memory: 250Mi | ||
limits: | ||
cpu: 1000m | ||
memory: 1000Mi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
$patch: delete | ||
apiVersion: v1 | ||
kind: Namespace | ||
metadata: | ||
name: appwrapper-system |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
- op: add | ||
path: /spec/template/spec/priorityClassName | ||
value: system-node-critical |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
apiVersion: kueue.x-k8s.io/v1beta1 | ||
kind: ResourceFlavor | ||
metadata: | ||
name: default-flavor |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
|
||
namespace: mlbatch-system | ||
|
||
resources: | ||
- "https://github.com/ray-project/kuberay/ray-operator/config/default?ref=v1.1.0" | ||
|
||
labels: | ||
- pairs: | ||
app.kubernetes.io/name: kuberay | ||
app.kubernetes.io/component: controller | ||
includeSelectors: true | ||
|
||
patches: | ||
- path: remove_default_namespace.yaml | ||
- path: manager_resources_patch.yaml |
Oops, something went wrong.