diff --git a/setup.RHOAI-v2.10/CLUSTER-SETUP.md b/setup.RHOAI-v2.10/CLUSTER-SETUP.md index a8f32e0..dc60c28 100644 --- a/setup.RHOAI-v2.10/CLUSTER-SETUP.md +++ b/setup.RHOAI-v2.10/CLUSTER-SETUP.md @@ -82,7 +82,7 @@ as follows: - pod priorities, resource requests and limits have been adjusted. To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition -in OpenShift AI 2.10 installation), do a rolling restart of the Kueue manager. +in OpenShift AI installation), do a rolling restart of the Kueue manager. ```sh oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications ``` diff --git a/setup.RHOAI-v2.11/CLUSTER-SETUP.md b/setup.RHOAI-v2.11/CLUSTER-SETUP.md index 267a9c7..e0e3c35 100644 --- a/setup.RHOAI-v2.11/CLUSTER-SETUP.md +++ b/setup.RHOAI-v2.11/CLUSTER-SETUP.md @@ -82,7 +82,7 @@ as follows: - pod priorities, resource requests and limits have been adjusted. To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition -in OpenShift AI 2.10 installation), do a rolling restart of the Kueue manager. +in OpenShift AI installation), do a rolling restart of the Kueue manager. ```sh oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications ``` diff --git a/setup.k8s-v1.25/appwrapper/kustomization.yaml b/setup.k8s-v1.25/appwrapper/kustomization.yaml index e36fe72..269e930 100644 --- a/setup.k8s-v1.25/appwrapper/kustomization.yaml +++ b/setup.k8s-v1.25/appwrapper/kustomization.yaml @@ -4,7 +4,7 @@ kind: Kustomization namespace: mlbatch-system resources: -- "https://github.com/project-codeflare/appwrapper/config/default?ref=v0.21.0" +- "https://github.com/project-codeflare/appwrapper/config/default?ref=v0.21.1" labels: - pairs: @@ -14,7 +14,7 @@ labels: images: - name: quay.io/ibm/appwrapper - newTag: v0.21.0 + newTag: v0.21.1 patches: - path: config_patch.yaml diff --git a/setup.k8s-v1.30/CLUSTER-SETUP.md b/setup.k8s-v1.30/CLUSTER-SETUP.md new file mode 100644 index 0000000..dce17d8 --- /dev/null +++ b/setup.k8s-v1.30/CLUSTER-SETUP.md @@ -0,0 +1,105 @@ +# Cluster Setup + +The cluster setup installs and configures the following components: ++ Coscheduler ++ Kubeflow Training Operator ++ KubeRay ++ Kueue ++ AppWrappers ++ Cluster roles and priority classes + +If MLBatch is deployed on a cluster that used to run earlier versions of ODH, +[MCAD](https://github.com/project-codeflare/mcad), or Coscheduler, +make sure to scrub traces of these installations. In particular, make sure to +delete the following custom resource definitions (CRD) if present on the +cluster. Make sure to delete all instances prior to deleting the CRDs: +```sh +# Delete old appwrappers and crd +kubectl delete appwrappers --all -A +kubectl delete crd appwrappers.workload.codeflare.dev + +# Delete old noderesourcetopologies and crd +kubectl delete noderesourcetopologies --all -A +kubectl delete crd noderesourcetopologies.topology.node.k8s.io +``` + +## Priorities + +Create `default-priority`, `high-priority`, and `low-priority` priority classes: +```sh +kubectl apply -f setup.k8s-v1.30/mlbatch-priorities.yaml +``` + +## Coscheduler + +Install Coscheduler v0.28.9 as a secondary scheduler and configure packing: +```sh +helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \ + scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \ + --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]' +``` +Patch Coscheduler pod priorities: +```sh +kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s-v1.30/coscheduler-priority-patch.yaml scheduler-plugins-controller +kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s-v1.30/coscheduler-priority-patch.yaml scheduler-plugins-scheduler +``` + +## Install Operators + +Create the mlbatch-system namespace +```sh +kubectl create namespace mlbatch-system +``` + +Install the Kubeflow Training Operator +```sh +kubectl apply --server-side -k setup.k8s-v1.30/training-operator +``` + +Install the KubeRay Operator +```sh +kubectl apply --server-side -k setup.k8s-v1.30/kuberay +``` + +Install Kueue +```sh +kubectl apply --server-side -k setup.k8s-v1.30/kueue +``` + +Install the AppWrapper Operator +```sh +kubectl apply --server-side -k setup.k8s-v1.30/appwrapper +``` +The provided configuration differs from the default configuration of the +operators as follows: +- Kubeflow Training Operator: + - `gang-scheduler-name` is set to `scheduler-plugins-scheduler`, +- Kueue: + - `waitForPodsReady` is disabled, +- AppWrapper operator: + - `userRBACAdmissionCheck` is disabled, + - `schedulerName` is set to `scheduler-plugins-scheduler`, + - `queueName` is set to `default-queue`, +- pod priorities, resource requests and limits have been adjusted. + +## Kueue Configuration + +Create Kueue's default flavor: +```sh +kubectl apply -f setup.k8s-v1.30/default-flavor.yaml +``` + +## Cluster Role + +Create `mlbatch-edit` role: +```sh +kubectl apply -f setup.k8s-v1.30/mlbatch-edit-role.yaml +``` +## Validating Admission Policy + +Create a validating admission policy that works with the mlbatch-edit role to +ensure that all pod-creating resources created in team namespaces will be properly +tracked for quota usage. +```sh +kubectl apply -f setup.k8s-v1.30/admission-policy.yaml +``` diff --git a/setup.k8s-v1.30/TEAM-SETUP.md b/setup.k8s-v1.30/TEAM-SETUP.md new file mode 100644 index 0000000..f9620d8 --- /dev/null +++ b/setup.k8s-v1.30/TEAM-SETUP.md @@ -0,0 +1,93 @@ +# Team Setup + +A *team* in MLBatch is a group of users that share a resource quota. + +Setting up a new team requires the cluster admin to create a namespace, +a quota, a queue, and the required role bindings as described below. + +Create namespace: +```sh +kubectl create namespace team1 +``` + +For each user on the team, create a RoleBinding: +```sh +kubectl -n team 1 apply -f- << EOF +kind: RoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: user-one +subjects: + - kind: User + apiGroup: rbac.authorization.k8s.io + name: user-one +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: mlbatch-edit +EOF +``` + +Specify the intended quota for the namespace by creating a `ClusterQueue`: +```sh +kubectl apply -f- << EOF +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: team1-cluster-queue +spec: + namespaceSelector: {} + cohort: default-cohort + preemption: + withinClusterQueue: LowerOrNewerEqualPriority + reclaimWithinCohort: Any + borrowWithinCohort: + policy: Never + resourceGroups: + - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"] + flavors: + - name: default-flavor + resources: + - name: "cpu" + nominalQuota: 8000m + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "memory" + nominalQuota: 128Gi + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "nvidia.com/gpu" + nominalQuota: 16 + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "nvidia.com/roce_gdr" + nominalQuota: 4 + # borrowingLimit: 0 + # lendingLimit: 0 + - name: "pods" + nominalQuota: 100 + # borrowingLimit: 0 + # lendingLimit: 0 +EOF +``` +Edit the above quantities to adjust the quota to the desired values. Pod counts +are optional and can be omitted from the list of covered resources. + +Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing +quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other +namespaces from borrowing quota from this namespace. + +Create a `LocalQueue` to bind the `ClusterQueue` to the namespace: +```sh +kubectl apply -n team1 -f- << EOF +apiVersion: kueue.x-k8s.io/v1beta1 +kind: LocalQueue +metadata: + name: default-queue +spec: + clusterQueue: team1-cluster-queue +EOF +``` +We recommend naming the local queue `default-queue` as `AppWrappers` will +default to this queue name. + diff --git a/setup.k8s-v1.30/UNINSTALL.md b/setup.k8s-v1.30/UNINSTALL.md new file mode 100644 index 0000000..f19608f --- /dev/null +++ b/setup.k8s-v1.30/UNINSTALL.md @@ -0,0 +1,23 @@ +# Uninstall + +***First, remove all team namespaces and corresponding cluster queues.*** + +Then to uninstall the MLBatch controllers and reclaim the corresponding +namespaces, do the following: +```sh +# Delete operators and CRDs +kubectl delete -k setup.k8s-v1.25/appwrapper +kubectl delete -k setup.k8s-v1.25/kueue +kubectl delete -k setup.k8s-v1.25/kuberay +kubectl delete -k setup.k8s-v1.25/training-operator + +# Delete namespace +kubectl delete namespace mlbatch-system + +# Delete clusterole +kubectl delete clusterrole mlbatch-edit + +# Coscheduler uninstall +helm uninstall -n scheduler-plugins scheduler-plugins +kubectl delete namespace scheduler-plugins +``` diff --git a/setup.k8s-v1.30/admission-policy.yaml b/setup.k8s-v1.30/admission-policy.yaml new file mode 100644 index 0000000..babab5a --- /dev/null +++ b/setup.k8s-v1.30/admission-policy.yaml @@ -0,0 +1,23 @@ +apiVersion: admissionregistration.k8s.io/v1 +kind: ValidatingAdmissionPolicy +metadata: + name: mlbatch-require-queue-name +spec: + failurePolicy: Fail + matchConstraints: + resourceRules: + - apiGroups: ["batch"] + apiVersions: ["v1"] + resources: ["jobs"] + operations: ["CREATE", "UPDATE"] + - apiGroups: ["kubeflow.org"] + apiVersions: ["v1"] + operations: ["CREATE", "UPDATE"] + resources: ["pytorchjobs"] + - apiGroups: ["cluster.ray.io"] + apiVersions: ["v1"] + operations: ["CREATE", "UPDATE"] + resources: ["rayjobs","rayclusters"] + validations: + - expression: "'kueue.x-k8s.io/queue-name' in object.metadata.labels && object.metadata.labels['kueue.x-k8s.io/queue-name'] != ''" + message: "The label 'kueue.x-k8s.io/queue-name' is either missing or does not have a value set." diff --git a/setup.k8s-v1.30/appwrapper/config_patch.yaml b/setup.k8s-v1.30/appwrapper/config_patch.yaml new file mode 100644 index 0000000..2c3cd85 --- /dev/null +++ b/setup.k8s-v1.30/appwrapper/config_patch.yaml @@ -0,0 +1,22 @@ +kind: ConfigMap +apiVersion: v1 +metadata: + name: appwrapper-operator-config + namespace: appwrapper-system +data: + config.yaml: | + appwrapper: + enableKueueIntegrations: true + kueueJobReconciller: + manageJobsWithoutQueueName: false + waitForPodsReady: + enable: false + defaultQueueName: default-queue + schedulerName: scheduler-plugins-scheduler + userRBACAdmissionCheck: false + controllerManager: + health: + bindAddress: ":8081" + metrics: + bindAddress: "127.0.0.1:8080" + leaderElection: true diff --git a/setup.k8s-v1.30/appwrapper/kustomization.yaml b/setup.k8s-v1.30/appwrapper/kustomization.yaml new file mode 100644 index 0000000..269e930 --- /dev/null +++ b/setup.k8s-v1.30/appwrapper/kustomization.yaml @@ -0,0 +1,22 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: mlbatch-system + +resources: +- "https://github.com/project-codeflare/appwrapper/config/default?ref=v0.21.1" + +labels: +- pairs: + app.kubernetes.io/name: appwrapper + app.kubernetes.io/component: controller + includeSelectors: true + +images: +- name: quay.io/ibm/appwrapper + newTag: v0.21.1 + +patches: +- path: config_patch.yaml +- path: manager_resources_patch.yaml +- path: remove_default_namespace.yaml diff --git a/setup.k8s-v1.30/appwrapper/manager_resources_patch.yaml b/setup.k8s-v1.30/appwrapper/manager_resources_patch.yaml new file mode 100644 index 0000000..1b26c3c --- /dev/null +++ b/setup.k8s-v1.30/appwrapper/manager_resources_patch.yaml @@ -0,0 +1,18 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: controller-manager + namespace: system +spec: + template: + spec: + priorityClassName: system-node-critical + containers: + - name: manager + resources: + requests: + cpu: 250m + memory: 250Mi + limits: + cpu: 1000m + memory: 1000Mi diff --git a/setup.k8s-v1.30/appwrapper/remove_default_namespace.yaml b/setup.k8s-v1.30/appwrapper/remove_default_namespace.yaml new file mode 100644 index 0000000..b63fb95 --- /dev/null +++ b/setup.k8s-v1.30/appwrapper/remove_default_namespace.yaml @@ -0,0 +1,5 @@ +$patch: delete +apiVersion: v1 +kind: Namespace +metadata: + name: appwrapper-system diff --git a/setup.k8s-v1.30/coscheduler-priority-patch.yaml b/setup.k8s-v1.30/coscheduler-priority-patch.yaml new file mode 100644 index 0000000..278802f --- /dev/null +++ b/setup.k8s-v1.30/coscheduler-priority-patch.yaml @@ -0,0 +1,3 @@ +- op: add + path: /spec/template/spec/priorityClassName + value: system-node-critical diff --git a/setup.k8s-v1.30/default-flavor.yaml b/setup.k8s-v1.30/default-flavor.yaml new file mode 100644 index 0000000..6cbccf3 --- /dev/null +++ b/setup.k8s-v1.30/default-flavor.yaml @@ -0,0 +1,4 @@ +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ResourceFlavor +metadata: + name: default-flavor diff --git a/setup.k8s-v1.30/kuberay/kustomization.yaml b/setup.k8s-v1.30/kuberay/kustomization.yaml new file mode 100644 index 0000000..0161395 --- /dev/null +++ b/setup.k8s-v1.30/kuberay/kustomization.yaml @@ -0,0 +1,17 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: mlbatch-system + +resources: +- "https://github.com/ray-project/kuberay/ray-operator/config/default?ref=v1.1.0" + +labels: +- pairs: + app.kubernetes.io/name: kuberay + app.kubernetes.io/component: controller + includeSelectors: true + +patches: +- path: remove_default_namespace.yaml +- path: manager_resources_patch.yaml diff --git a/setup.k8s-v1.30/kuberay/manager_resources_patch.yaml b/setup.k8s-v1.30/kuberay/manager_resources_patch.yaml new file mode 100644 index 0000000..7bb80d9 --- /dev/null +++ b/setup.k8s-v1.30/kuberay/manager_resources_patch.yaml @@ -0,0 +1,20 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: kuberay-operator + namespace: system +spec: + template: + spec: + priorityClassName: system-node-critical + containers: + - name: kuberay-operator + args: + - "--zap-log-level=2" + resources: + requests: + cpu: 100m + memory: 100Mi + limits: + cpu: 500m + memory: 1000Mi diff --git a/setup.k8s-v1.30/kuberay/remove_default_namespace.yaml b/setup.k8s-v1.30/kuberay/remove_default_namespace.yaml new file mode 100644 index 0000000..b5977cc --- /dev/null +++ b/setup.k8s-v1.30/kuberay/remove_default_namespace.yaml @@ -0,0 +1,5 @@ +$patch: delete +apiVersion: v1 +kind: Namespace +metadata: + name: ray-system diff --git a/setup.k8s-v1.30/kueue/controller_manager_config.yaml b/setup.k8s-v1.30/kueue/controller_manager_config.yaml new file mode 100644 index 0000000..0e90387 --- /dev/null +++ b/setup.k8s-v1.30/kueue/controller_manager_config.yaml @@ -0,0 +1,64 @@ +apiVersion: config.kueue.x-k8s.io/v1beta1 +kind: Configuration +health: + healthProbeBindAddress: :8081 +metrics: + bindAddress: :8080 +# enableClusterQueueResources: true +webhook: + port: 9443 +leaderElection: + leaderElect: true + resourceName: c1f6bfd2.kueue.x-k8s.io +controller: + groupKindConcurrency: + Job.batch: 5 + Pod: 5 + Workload.kueue.x-k8s.io: 5 + LocalQueue.kueue.x-k8s.io: 1 + ClusterQueue.kueue.x-k8s.io: 1 + ResourceFlavor.kueue.x-k8s.io: 1 +clientConnection: + qps: 50 + burst: 100 +#pprofBindAddress: :8083 +waitForPodsReady: + enable: false +# timeout: 5m +# blockAdmission: false +# requeuingStrategy: +# timestamp: Eviction +# backoffLimitCount: null # null indicates infinite requeuing +# backoffBaseSeconds: 60 +# backoffMaxSeconds: 3600 +#manageJobsWithoutQueueName: false +#internalCertManagement: +# enable: false +# webhookServiceName: "" +# webhookSecretName: "" +integrations: + frameworks: + - "batch/job" + - "kubeflow.org/mpijob" + - "ray.io/rayjob" + - "ray.io/raycluster" + - "jobset.x-k8s.io/jobset" + - "kubeflow.org/mxjob" + - "kubeflow.org/paddlejob" + - "kubeflow.org/pytorchjob" + - "kubeflow.org/tfjob" + - "kubeflow.org/xgboostjob" + # - "pod" + externalFrameworks: + - "AppWrapper.v1beta2.workload.codeflare.dev" +# podOptions: +# namespaceSelector: +# matchExpressions: +# - key: kubernetes.io/metadata.name +# operator: NotIn +# values: [ kube-system, kueue-system ] +#fairSharing: +# enable: true +# preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare] +#resources: +# excludeResourcePrefixes: [] diff --git a/setup.k8s-v1.30/kueue/kustomization.yaml b/setup.k8s-v1.30/kueue/kustomization.yaml new file mode 100644 index 0000000..e9d3842 --- /dev/null +++ b/setup.k8s-v1.30/kueue/kustomization.yaml @@ -0,0 +1,46 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: mlbatch-system + +resources: +- "https://github.com/kubernetes-sigs/kueue/config/default?ref=v0.7.1" + +labels: +- pairs: + app.kubernetes.io/name: kueue + app.kubernetes.io/component: controller + includeSelectors: true + +configMapGenerator: +- name: manager-config + namespace: kueue-system + behavior: replace + files: + - controller_manager_config.yaml + +images: +- name: gcr.io/k8s-staging-kueue/kueue + newName: registry.k8s.io/kueue/kueue + newTag: v0.7.1 + +patches: +- path: manager_resources_patch.yaml +- path: mutating_webhook_patch.yaml +- path: remove_default_namespace.yaml +- path: validating_webhook_patch.yaml +- target: + kind: ClusterRole + name: manager-role + patch: | + - op: add + path: /rules/- + value: + apiGroups: + - workload.codeflare.dev + resources: + - appwrappers + verbs: + - get + - list + - watch diff --git a/setup.k8s-v1.30/kueue/manager_resources_patch.yaml b/setup.k8s-v1.30/kueue/manager_resources_patch.yaml new file mode 100644 index 0000000..5dc7501 --- /dev/null +++ b/setup.k8s-v1.30/kueue/manager_resources_patch.yaml @@ -0,0 +1,9 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: controller-manager + namespace: system +spec: + template: + spec: + priorityClassName: system-node-critical diff --git a/setup.k8s-v1.30/kueue/mutating_webhook_patch.yaml b/setup.k8s-v1.30/kueue/mutating_webhook_patch.yaml new file mode 100644 index 0000000..61d0e1d --- /dev/null +++ b/setup.k8s-v1.30/kueue/mutating_webhook_patch.yaml @@ -0,0 +1,9 @@ +apiVersion: admissionregistration.k8s.io/v1 +kind: MutatingWebhookConfiguration +metadata: + name: mutating-webhook-configuration +webhooks: + - $patch: delete + name: mpod.kb.io + - $patch: delete + name: mjob.kb.io diff --git a/setup.k8s-v1.30/kueue/remove_default_namespace.yaml b/setup.k8s-v1.30/kueue/remove_default_namespace.yaml new file mode 100644 index 0000000..787ee88 --- /dev/null +++ b/setup.k8s-v1.30/kueue/remove_default_namespace.yaml @@ -0,0 +1,5 @@ +$patch: delete +apiVersion: v1 +kind: Namespace +metadata: + name: kueue-system diff --git a/setup.k8s-v1.30/kueue/validating_webhook_patch.yaml b/setup.k8s-v1.30/kueue/validating_webhook_patch.yaml new file mode 100644 index 0000000..711b05d --- /dev/null +++ b/setup.k8s-v1.30/kueue/validating_webhook_patch.yaml @@ -0,0 +1,7 @@ +apiVersion: admissionregistration.k8s.io/v1 +kind: ValidatingWebhookConfiguration +metadata: + name: validating-webhook-configuration +webhooks: + - $patch: delete + name: vpod.kb.io diff --git a/setup.k8s-v1.30/mlbatch-edit-role.yaml b/setup.k8s-v1.30/mlbatch-edit-role.yaml new file mode 100644 index 0000000..5e279c6 --- /dev/null +++ b/setup.k8s-v1.30/mlbatch-edit-role.yaml @@ -0,0 +1,112 @@ +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: mlbatch-edit +rules: +- apiGroups: + - "" + resources: + - pods + verbs: + - delete + - get + - list + - watch +- apiGroups: + - apps + resources: + - deployments + - statefulsets + verbs: + - delete + - get + - list + - watch +- apiGroups: + - "" + resources: + - services + - secrets + - configmaps + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - kueue.x-k8s.io + resources: + - "*" + verbs: + - get + - list + - watch +- apiGroups: + - kubeflow.org + resources: + - pytorchjobs + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - cluster.ray.io + resources: + - rayjobs + - rayclusters + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - batch + resources: + - jobs + verbs: + - delete + - get + - list + - watch +- apiGroups: + - workload.codeflare.dev + resources: + - appwrappers + verbs: + - create + - delete + - get + - list + - patch + - update + - watch +- apiGroups: + - scheduling.k8s.io + resources: + - priorityclasses + verbs: + - get + - list + - watch +- apiGroups: + - scheduling.x-k8s.io + resources: + - podgroups + verbs: + - create + - delete + - get + - list + - patch + - update + - watch diff --git a/setup.k8s-v1.30/mlbatch-priorities.yaml b/setup.k8s-v1.30/mlbatch-priorities.yaml new file mode 100644 index 0000000..77c8f3b --- /dev/null +++ b/setup.k8s-v1.30/mlbatch-priorities.yaml @@ -0,0 +1,26 @@ +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: low-priority +value: 1 +preemptionPolicy: PreemptLowerPriority +globalDefault: false +description: "This is the priority class for all lower priority jobs." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: default-priority +value: 5 +preemptionPolicy: PreemptLowerPriority +globalDefault: true +description: "This is the priority class for all jobs (default priority)." +--- +apiVersion: scheduling.k8s.io/v1 +kind: PriorityClass +metadata: + name: high-priority +value: 10 +preemptionPolicy: PreemptLowerPriority +globalDefault: false +description: "This is the priority class defined for highly important jobs that would evict lower and default priority jobs." diff --git a/setup.k8s-v1.30/training-operator/kustomization.yaml b/setup.k8s-v1.30/training-operator/kustomization.yaml new file mode 100644 index 0000000..6aa6dc2 --- /dev/null +++ b/setup.k8s-v1.30/training-operator/kustomization.yaml @@ -0,0 +1,19 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +namespace: mlbatch-system + +resources: +- "https://github.com/kubeflow/training-operator/manifests/base?ref=v1.7.0" + +labels: +- pairs: + app.kubernetes.io/name: training-operator + app.kubernetes.io/component: controller + includeSelectors: true + +images: +- name: kubeflow/training-operator + newTag: "v1-855e096" + +patches: +- path: manager_resources_patch.yaml diff --git a/setup.k8s-v1.30/training-operator/manager_resources_patch.yaml b/setup.k8s-v1.30/training-operator/manager_resources_patch.yaml new file mode 100644 index 0000000..5bc1f6d --- /dev/null +++ b/setup.k8s-v1.30/training-operator/manager_resources_patch.yaml @@ -0,0 +1,20 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: training-operator +spec: + template: + spec: + priorityClassName: system-node-critical + containers: + - name: training-operator + args: + - "--zap-log-level=2" + - "--gang-scheduler-name=scheduler-plugins-scheduler" + resources: + requests: + cpu: 100m + memory: 100Mi + limits: + cpu: 500m + memory: 1000Mi diff --git a/setup.tmpl/CLUSTER-SETUP.md.tmpl b/setup.tmpl/CLUSTER-SETUP.md.tmpl index 0a4c211..cba16fe 100644 --- a/setup.tmpl/CLUSTER-SETUP.md.tmpl +++ b/setup.tmpl/CLUSTER-SETUP.md.tmpl @@ -108,7 +108,7 @@ as follows: - pod priorities, resource requests and limits have been adjusted. To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition -in OpenShift AI 2.10 installation), do a rolling restart of the Kueue manager. +in OpenShift AI installation), do a rolling restart of the Kueue manager. ```sh {{ .KUBECTL }} rollout restart deployment/kueue-controller-manager -n redhat-ods-applications ``` @@ -153,8 +153,10 @@ operators as follows: - Kubeflow Training Operator: - `gang-scheduler-name` is set to `scheduler-plugins-scheduler`, - Kueue: +{{- if not .VAP }} - `manageJobsWithoutQueueName` is enabled, - `batch/job` integration is disabled, +{{- end }} - `waitForPodsReady` is disabled, - AppWrapper operator: - `userRBACAdmissionCheck` is disabled, @@ -177,3 +179,14 @@ Create `mlbatch-edit` role: ```sh {{ .KUBECTL }} apply -f setup.{{ .VERSION }}/mlbatch-edit-role.yaml ``` + +{{- if .VAP }} +## Validating Admission Policy + +Create a validating admission policy that works with the mlbatch-edit role to +ensure that all pod-creating resources created in team namespaces will be properly +tracked for quota usage. +```sh +{{ .KUBECTL }} apply -f setup.{{ .VERSION }}/admission-policy.yaml +``` +{{- end }} diff --git a/setup.tmpl/Kubernetes-v1.25.yaml b/setup.tmpl/Kubernetes-v1.25.yaml index f3b721e..b0be707 100644 --- a/setup.tmpl/Kubernetes-v1.25.yaml +++ b/setup.tmpl/Kubernetes-v1.25.yaml @@ -1,5 +1,6 @@ -# Values for RHOAI 2.11 +# Values for Kubernetes v1.25+ OPENSHIFT: false VERSION: k8s-v1.25 KUBECTL: kubectl +VAP: false diff --git a/setup.tmpl/Kubernetes-v1.30.yaml b/setup.tmpl/Kubernetes-v1.30.yaml new file mode 100644 index 0000000..95f5974 --- /dev/null +++ b/setup.tmpl/Kubernetes-v1.30.yaml @@ -0,0 +1,6 @@ +# Values for Kubernetes v1.30+ + +OPENSHIFT: false +VERSION: k8s-v1.30 +KUBECTL: kubectl +VAP: true diff --git a/setup.tmpl/Makefile b/setup.tmpl/Makefile index d85a5a4..8fdf1c5 100644 --- a/setup.tmpl/Makefile +++ b/setup.tmpl/Makefile @@ -27,6 +27,8 @@ docs: gotmpl ../tools/gotmpl/gotmpl -input ./TEAM-SETUP.md.tmpl -output ../setup.RHOAI-v2.11/TEAM-SETUP.md -values RHOAI-v2.11.yaml ../tools/gotmpl/gotmpl -input ./CLUSTER-SETUP.md.tmpl -output ../setup.k8s-v1.25/CLUSTER-SETUP.md -values Kubernetes-v1.25.yaml ../tools/gotmpl/gotmpl -input ./TEAM-SETUP.md.tmpl -output ../setup.k8s-v1.25/TEAM-SETUP.md -values Kubernetes-v1.25.yaml + ../tools/gotmpl/gotmpl -input ./CLUSTER-SETUP.md.tmpl -output ../setup.k8s-v1.30/CLUSTER-SETUP.md -values Kubernetes-v1.30.yaml + ../tools/gotmpl/gotmpl -input ./TEAM-SETUP.md.tmpl -output ../setup.k8s-v1.30/TEAM-SETUP.md -values Kubernetes-v1.30.yaml ##@ Dependencies