The cluster setup installs Red Hat OpenShift AI and Coscheduler, configures Kueue, cluster roles, and priority classes.
Create default-priority
, high-priority
, and low-priority
priority classes:
oc apply -f setup.RHOAI-v2.10/mlbatch-priorities.yaml
Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'
Patch Coscheduler pod priorities:
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.10/coscheduler-priority-patch.yaml scheduler-plugins-controller
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.10/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
Create the Red Hat OpenShift AI subscription:
oc apply -f setup.RHOAI-v2.10/mlbatch-subscription.yaml
Identify install plan:
oc get ip -n redhat-ods-operator
NAMESPACE NAME CSV APPROVAL APPROVED
redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false
Approve install plan replacing the generated plan name below with the actual value:
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
Create DSC Initialization:
oc apply -f setup.RHOAI-v2.10/mlbatch-dsci.yaml
Create Data Science Cluster:
oc apply -f setup.RHOAI-v2.10/mlbatch-dsc.yaml
The provided DSCI and DSC are intended to install a minimal set of Red Hat OpenShift
AI managed components: codeflare
, kueue
, ray
, and trainingoperator
. The
remaining components such as dashboard
can be optionally enabled.
The configuration of the managed components differs from the default Red Hat OpenShift AI configuration as follows:
- Kubeflow Training Operator:
gang-scheduler-name
is set toscheduler-plugins-scheduler
,
- Kueue:
manageJobsWithoutQueueName
is enabled,batch/job
integration is disabled,waitForPodsReady
is disabled,LendingLimit
feature gate is enabled,enableClusterQueueResources
metrics is enabled,
- Codeflare operator:
- the AppWrapper controller is enabled and configured as follows:
userRBACAdmissionCheck
is disabled,schedulerName
is set toscheduler-plugins-scheduler
,queueName
is set todefault-queue
,
- the AppWrapper controller is enabled and configured as follows:
- pod priorities, resource requests and limits have been adjusted.
To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition in Red Hat OpenShift AI installation), do a rolling restart of the Kueue manager.
oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications
After doing the restart, verify that you see the following lines in the kueue-controller-manager's log:
{"level":"info","ts":"2024-06-25T20:17:25.689638786Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:189","msg":"Registering a validating webhook","GVK":"kubeflow.org/v1, Kind=PyTorchJob","path":"/validate-kubeflow-org-v1-pytorchjob"}
{"level":"info","ts":"2024-06-25T20:17:25.689698615Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v1-pytorchjob"}
{"level":"info","ts":"2024-06-25T20:17:25.689743757Z","logger":"setup","caller":"jobframework/setup.go:81","msg":"Set up controller and webhook for job framework","jobFrameworkName":"kubeflow.org/pytorchjob"}
Create Kueue's default flavor:
oc apply -f setup.RHOAI-v2.10/default-flavor.yaml
Create mlbatch-edit
role:
oc apply -f setup.RHOAI-v2.10/mlbatch-edit-role.yaml