The cluster setup installs and configures the following components:
- Coscheduler
- Kubeflow Training Operator
- KubeRay
- Kueue
- AppWrappers
- Cluster roles and priority classes
Create default-priority
, high-priority
, and low-priority
priority classes:
kubectl apply -f setup.k8s-v1.30/mlbatch-priorities.yaml
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
Patch Coscheduler pod priorities:
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s-v1.30/coscheduler-priority-patch.yaml scheduler-plugins-controller
kubectl patch deployment -n scheduler-plugins --type=json --patch-file setup.k8s-v1.30/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
Create the mlbatch-system namespace
kubectl create namespace mlbatch-system
Install the Kubeflow Training Operator
kubectl apply --server-side -k setup.k8s-v1.30/training-operator
Install the KubeRay Operator
kubectl apply --server-side -k setup.k8s-v1.30/kuberay
Install Kueue
kubectl apply --server-side -k setup.k8s-v1.30/kueue
Install the AppWrapper Operator
kubectl apply --server-side -k setup.k8s-v1.30/appwrapper
The provided configuration differs from the default configuration of the operators as follows:
- Kubeflow Training Operator:
gang-scheduler-name
is set toscheduler-plugins-scheduler
,
- Kueue:
batch/job
integration is disabled,waitForPodsReady
is disabled,LendingLimit
feature gate is enabled,fairSharing
is enabled,enableClusterQueueResources
metrics is enabled,
- AppWrapper operator:
userRBACAdmissionCheck
is disabled,schedulerName
is set toscheduler-plugins-scheduler
,queueName
is set todefault-queue
,
- pod priorities, resource requests and limits have been adjusted.
Create Kueue's default flavor:
kubectl apply -f setup.k8s-v1.30/default-flavor.yaml
Create mlbatch-edit
role:
kubectl apply -f setup.k8s-v1.30/mlbatch-edit-role.yaml
Create an admission policy to enforce that all pod-creating resources permitted by the mlbatch-edit role that are created in team namespaces will have local queue names and thus be subject to Kueue's quota management.
kubectl apply -f setup.k8s-v1.30/admission-policy.yaml
Create the designated slack ClusterQueue
which will be used to automate
minor adjustments to cluster capacity caused by node failures and
scheduler maintanence.
kubectl apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: slack-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
- name: "memory"
nominalQuota: 128Gi
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "nvidia.com/roce_gdr"
nominalQuota: 1
- name: "pods"
nominalQuota: 100
EOF
Edit the above quantities to adjust the quota to the desired
values. Pod counts are optional and can be omitted from the list of
covered resources. The lendingLimit
for each resource will be
dynamically adjusted by the MLBatch system to reflect reduced cluster
capacity. See QUOTA_MAINTENANCE.md for a
detailed discussion of the role of the slack ClusterQueue
.