diff --git a/setup.KubeConEU25/README.md b/setup.KubeConEU25/README.md index 2363f51..97a7c1e 100644 --- a/setup.KubeConEU25/README.md +++ b/setup.KubeConEU25/README.md @@ -1,15 +1,30 @@ # MLBatch Tutorial In this tutorial, we walk through all the steps necessary to setup MLBatch on a -Kubernetes cluster and run a few example workloads. Prior to the [cluster -setup](../setup.k8s/CLUSTER-SETUP.md), we will configure storage classes and -Prometheus. We will configure team `blue` with user `alice` and `red` with user -`bob` following the [team setup](../setup.k8s/TEAM-SETUP.md). +Kubernetes cluster and run a few example workloads. +- We configure persistent storage using +[NFS](https://en.wikipedia.org/wiki/Network_File_System). +- We deploy MLBatch following the + [CLUSTER-SETUP.md](../setup.k8s/CLUSTER-SETUP.md) instructions. +- We configure example teams following the + [TEAM-SETUP.md](../setup.k8s/TEAM-SETUP.md) instructions. +- We reconfigure Autopilot to periodically assess the storage class in addition + to running network and GPU tests. _This is optional._ +- We deploy [Prometheus](https://prometheus.io) and [Grafana +dashboards](https://grafana.com/grafana/dashboards/) to monitor the health of +the cluster and GPU utilization. _This is optional._ +- We demonstrate the queueing, quota management, and fault recovery capabilities + of MLBatch using synthetic workloads. +- We run example workloads using vLLM, PyTorch, and Ray. ## Cluster Characteristics Our target cluster comprises three control planes nodes and three worker nodes -running Kubernetes 1.29 (from OpenShift 4.16.36). +running Kubernetes 1.29, specifically [OpenShift +4.16](https://docs.openshift.com/container-platform/4.16/release_notes/ocp-4-16-release-notes.html). + +
+ ```sh kubectl get nodes ``` @@ -22,7 +37,8 @@ pokprod002ctrl0 Ready control-plane,master 5d15h v1.29.11+148a389 pokprod002ctrl1 Ready control-plane,master 5d15h v1.29.11+148a389 pokprod002ctrl2 Ready control-plane,master 5d15h v1.29.11+148a389 ``` -Each worker node is equipped with eight H100 NVIDIA gpus. +Each worker node is equipped with eight [NVIDIA +H100](https://www.nvidia.com/en-us/data-center/h100/) GPUs. ```sh kubectl describe node pokprod-b93r38s3 ``` @@ -31,9 +47,9 @@ Name: pokprod-b93r38s3 Roles: worker Labels: beta.kubernetes.io/arch=amd64 ... - nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3 + nvidia.com/GPU.product=NVIDIA-H100-80GB-HBM3 ... - nvidia.com/gpu.count=8 + nvidia.com/GPU.count=8 ... Capacity: cpu: 224 @@ -41,41 +57,43 @@ Capacity: hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 2113411308Ki - nvidia.com/gpu: 8 + nvidia.com/GPU: 8 openshift.io/p0_storage_sriov_nodepolicy: 8 pods: 250 rdma/roce_gdr: 0 ... ``` For this tutorial, we assume the [NVIDIA GPU -operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) +operator](https://docs.nvidia.com/datacenter/cloud-native/GPU-operator/latest/index.html) is already -[installed](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html) +[installed](https://docs.nvidia.com/datacenter/cloud-native/GPU-operator/latest/getting-started.html) on the cluster. While this cluster is capable of [GPU-direct RDMA (GDR) with ROCE (RDMA over Converged -Ethernet)](https://medium.com/@sunyanan.choochotkaew1/unlocking-gpudirect-rdma-on-roce-in-kubernetes-based-cluster-on-cloud-through-multi-nic-cni-1e69ffb96296), -we will not cover advanced networking topics in this tutorial and disable this -feature. +Ethernet)](https://medium.com/@sunyanan.choochotkaew1/unlocking-GPUdirect-rdma-on-roce-in-kubernetes-based-cluster-on-cloud-through-multi-nic-cni-1e69ffb96296), +we will not cover or rely on advanced networking configurations in this +tutorial. -## MLBatch Setup +
-### Storage Setup +## Persistent Storage Setup -We assume storage is available by means of preconfigured -[NFS](https://en.wikipedia.org/wiki/Network_File_System) servers. We configure +We assume storage is available by means of a preexisting +[NFS](https://en.wikipedia.org/wiki/Network_File_System) server. We configure one storage class using the [NFS Subdir External Provisioner](https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner). + +
+ ```sh -helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner -helm repo update +helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner && helm repo update helm install -n nfs-provisioner pokprod nfs-subdir-external-provisioner/nfs-subdir-external-provisioner \ --create-namespace \ --set nfs.server=192.168.98.96 --set nfs.path=/gpfs/fs_ec/pokprod002 \ --set storageClass.name=nfs-client-pokprod --set storageClass.provisionerName=k8s-sigs.io/pokprod-nfs-subdir-external-provisioner ``` -Make sure to replace the server ip and path above with the right values for your -environment. +Make sure to set the `nfs.server` and `nfs.path` values to the right values for +your environment. ```sh kubectl get storageclasses ``` @@ -84,134 +102,19 @@ NAME PROVISIONER R nfs-client-pokprod k8s-sigs.io/pokprod-nfs-subdir-external-provisioner Delete Immediate true 11s ``` OpenShift clusters require an additional configuration step to permit the -provisioner pod to mount the storage: +provisioner pod to mount the storage volume. ```sh oc adm policy add-scc-to-user hostmount-anyuid system:serviceaccount:nfs-provisioner:pokprod-nfs-subdir-external-provisioner ``` -### Prometheus Setup - -We follow the setup provided by the `prometheus-community/kube-prometheus-stack` Helm chart. - -```bash -helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update -``` - -The charts will install: Prometheus, Grafana, Alert Manager, Prometheus Node Exporter and Kube State Metrics. We set up the chart with the following: - -- Persistent storage for Prometheus, Grafana and Alert Manager; -- Override the Prometheus Node Exporter port; -- Disable CRDs creation as they are already present. - -You may leave the CRDs creation on, along with the default Node Exporter pod. These changes are needed when deploying a separate Prometheus instance in OpenShift. - -```bash -cat << EOF >> config.yaml -crds: - enabled: false - -prometheus-node-exporter: - service: - port: 9110 - -alertmanager: - alertmanagerSpec: - persistentVolumeClaimRetentionPolicy: - whenDeleted: Retain - whenScaled: Retain - storage: - volumeClaimTemplate: - spec: - storageClassName: nfs-client-pokprod - accessModes: ["ReadWriteOnce"] - resources: - requests: - storage: 50Gi - -prometheus: - prometheusSpec: - persistentVolumeClaimRetentionPolicy: - whenDeleted: Retain - whenScaled: Retain - storageSpec: - volumeClaimTemplate: - spec: - storageClassName: nfs-client-pokprod - accessModes: ["ReadWriteOnce"] - resources: - requests: - storage: 50Gi - emptyDir: - medium: Memory - -grafana: - persistence: - enabled: true - type: sts - storageClassName: "nfs-client-pokprod" - accessModes: - - ReadWriteOnce - size: 20Gi - finalizers: - - kubernetes.io/pvc-protection -EOF - -helm upgrade -i kube-prometheus-stack -n prometheus prometheus-community/kube-prometheus-stack --create-namespace -f config.yaml -``` - -If deploying on OpenShift based systems, you need to assign the privileged security context to the service accounts that are created by the helm chart. - -```bash -oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:kube-prometheus-stack-admission system:serviceaccount:prometheus:kube-prometheus-stack-alertmanager system:serviceaccount:prometheus:kube-prometheus-stack-grafana system:serviceaccount:prometheus:kube-prometheus-stack-kube-state-metrics system:serviceaccount:prometheus:kube-prometheus-stack-operator system:serviceaccount:prometheus:kube-prometheus-stack-prometheus system:serviceaccount:prometheus:kube-prometheus-stack-prometheus-node-exporter -``` - -You should expect the following pods: +
-```bash -kubectl get pods -``` -```bash -NAME READY STATUS RESTARTS AGE -alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 16m -kube-prometheus-stack-grafana-0 3/3 Running 0 16m -kube-prometheus-stack-kube-state-metrics-6f76b98d89-pxs69 1/1 Running 0 16m -kube-prometheus-stack-operator-7fbfc985bb-mm9bk 1/1 Running 0 16m -kube-prometheus-stack-prometheus-node-exporter-44llp 1/1 Running 0 16m -kube-prometheus-stack-prometheus-node-exporter-95gp8 1/1 Running 0 16m -kube-prometheus-stack-prometheus-node-exporter-dxf5f 1/1 Running 0 16m -kube-prometheus-stack-prometheus-node-exporter-f45dx 1/1 Running 0 16m -kube-prometheus-stack-prometheus-node-exporter-pfrzk 1/1 Running 0 16m -kube-prometheus-stack-prometheus-node-exporter-zpfzb 1/1 Running 0 16m -prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 16m -``` - -To access the Grafana dashboard on `localhost:3000`: - -```bash -kubectl --namespace prometheus get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo -``` -```bash -export POD_NAME=$(kubectl --namespace prometheus get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname) - kubectl --namespace prometheus port-forward $POD_NAME 3000 -``` - -To import NVidia and Autopilot metrics, from the Grafana Dashboard: - -- Select the `+` drop down menu on the top right, and **Import dashboard** -- In the `Grafana.com dashboard URL or ID` box, add [https://grafana.com/grafana/dashboards/23123-autopilot-metrics/](https://grafana.com/grafana/dashboards/23123-autopilot-metrics/) and click Load, then repeat with the NVidia dashboard [https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/) - -To visualize the metrics, we need to label the service monitor objects in both `autopilot` and `nvidia-gpu-operator` namespaces with the Prometheus release name. - -```bash -kubectl label servicemonitors.monitoring.coreos.com -n autopilot autopilot-metrics-monitor release=kube-prometheus-stack --overwrite -``` -```bash -kubectl label servicemonitors.monitoring.coreos.com -n nvidia-gpu-operator nvidia-dcgm-exporter gpu-operator nvidia-node-status-exporter release=kube-prometheus-stack --overwrite -``` +## MLBatch Cluster Setup -### MLBatch Cluster Setup +We deploy MLBatch to the cluster following +[CLUSTER-SETUP.md](../setup.k8s/CLUSTER-SETUP.md). -We follow instructions from [CLUSTER-SETUP.md](../setup.k8s/CLUSTER-SETUP.md). +
```sh # Clone MLBatch repository @@ -222,7 +125,7 @@ cd mlbatch kubectl apply -f setup.k8s/mlbatch-priorities.yaml # Deploy scheduler plugins -helm install scheduler-plugins --namespace scheduler-plugins --create-namespace scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]' +helm install scheduler-plugins --namespace scheduler-plugins --create-namespace scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/GPU","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]' # Wait for scheduler-plugins pods to be running kubectl get pods -n scheduler-plugins @@ -250,12 +153,9 @@ kubectl get pods -n mlbatch-system kubectl apply --server-side -k setup.k8s/appwrapper/coscheduling # Deploy Autopilot -helm repo add autopilot https://ibm.github.io/autopilot/ -helm repo update - -helm upgrade autopilot autopilot/autopilot --install -n autopilot --create-namespace +helm repo add autopilot https://ibm.github.io/autopilot/ && helm repo update -kubectl label servicemonitors -n autopilot autopilot-metrics-monitor release=kube-prometheus-stack --overwrite +helm upgrade --install autopilot -n autopilot autopilot/autopilot --create-namespace # Create Kueue's default flavor kubectl apply -f setup.k8s/default-flavor.yaml @@ -263,7 +163,7 @@ kubectl apply -f setup.k8s/default-flavor.yaml # Setup mlbatch-edit-role kubectl apply -f setup.k8s/mlbatch-edit-role.yaml -# Create slack cluster queue with 8 gpus +# Create slack cluster queue with 8 GPUs kubectl apply -f- << EOF apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue @@ -278,7 +178,7 @@ spec: borrowWithinCohort: policy: Never resourceGroups: - - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "pods"] + - coveredResources: ["cpu", "memory", "nvidia.com/GPU", "pods"] flavors: - name: default-flavor resources: @@ -286,7 +186,7 @@ spec: nominalQuota: 224 - name: "memory" nominalQuota: 2000G - - name: "nvidia.com/gpu" + - name: "nvidia.com/GPU" nominalQuota: 8 - name: "pods" nominalQuota: 100 @@ -294,31 +194,16 @@ EOF ``` We reserve 8 GPUs out of 24 for MLBatch's slack queue. -### Autopilot Extended Setup +
-It is possible to configure Autopilot so that it will test PVC creation and deletion given a storage class name. +## MLBatch Teams Setup -```bash -cat << EOF >> autopilot-extended.yaml -env: - - name: "PERIODIC_CHECKS" - value: "pciebw,remapped,dcgm,ping,gpupower,pvc" - - name: "PVC_TEST_STORAGE_CLASS" - value: "nfs-client-pokprod" -EOF -``` - -Then reapply the helm chart, this will start a rollout update. - -```bash -helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f autopilot-extended.yaml -``` +We configure team `blue` with user `alice` and `red` with user `bob` following +[TEAM-SETUP.md](../setup.k8s/TEAM-SETUP.md). Each team has a nominal quota of 8 +GPUs. -### MLBatch Teams Setup +
-We configure team `blue` with user `alice` and `red` with user `bob` following -the [team setup](../setup.k8s/TEAM-SETUP.md). Each team has a nominal quota of -eight GPUs. ```sh # Create namespaces kubectl create ns blue @@ -342,7 +227,7 @@ spec: borrowWithinCohort: policy: Never resourceGroups: - - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "pods"] + - coveredResources: ["cpu", "memory", "nvidia.com/GPU", "pods"] flavors: - name: default-flavor resources: @@ -350,7 +235,7 @@ spec: nominalQuota: 224 - name: "memory" nominalQuota: 2000G - - name: "nvidia.com/gpu" + - name: "nvidia.com/GPU" nominalQuota: 8 - name: "pods" nominalQuota: 100 @@ -379,7 +264,7 @@ spec: borrowWithinCohort: policy: Never resourceGroups: - - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "pods"] + - coveredResources: ["cpu", "memory", "nvidia.com/GPU", "pods"] flavors: - name: default-flavor resources: @@ -387,7 +272,7 @@ spec: nominalQuota: 224 - name: "memory" nominalQuota: 2000G - - name: "nvidia.com/gpu" + - name: "nvidia.com/GPU" nominalQuota: 8 - name: "pods" nominalQuota: 100 @@ -439,12 +324,182 @@ portable. In this tutorial, we will rely on [user impersonation](https://kubernetes.io/docs/reference/access-authn-authz/authentication/#user-impersonation) with `kubectl` to run as a specific user. +
+ +## Extended Autopilot Setup + +Optionally, we configure Autopilot to test PVC creation and deletion with the +`nfs-client-pokprod` storage class. + +
+ +First create the extended Autopilot configuration. +```sh +cat << EOF > autopilot-extended.yaml +env: + - name: "PERIODIC_CHECKS" + value: "pciebw,remapped,dcgm,ping,gpupower,pvc" + - name: "PVC_TEST_STORAGE_CLASS" + value: "nfs-client-pokprod" +EOF +``` +Then reapply the helm chart, this will start a rollout update. +```sh +helm upgrade autopilot autopilot/autopilot --install --namespace=autopilot --create-namespace -f autopilot-extended.yaml +``` + +
+ +## Monitoring Setup + +Optionally, we deploy [Prometheus](https://prometheus.io) and [Grafana +dashboards](https://grafana.com/grafana/dashboards/) to the cluster. + +
+ +We follow the setup provided by the `prometheus-community/kube-prometheus-stack` +Helm chart. + +```sh +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update +``` + +The charts will install: Prometheus, Grafana, Alert Manager, Prometheus Node +Exporter and Kube State Metrics. We set up the chart with the following: + +- Persistent storage for Prometheus, Grafana and Alert Manager; +- Override the Prometheus Node Exporter port; +- Disable CRDs creation as they are already present. + +You may leave the CRDs creation on, along with the default Node Exporter pod. +These changes are needed when deploying a separate Prometheus instance in +OpenShift. + +```sh +cat << EOF > config.yaml +crds: + enabled: false + +prometheus-node-exporter: + service: + port: 9110 + +alertmanager: + alertmanagerSpec: + persistentVolumeClaimRetentionPolicy: + whenDeleted: Retain + whenScaled: Retain + storage: + volumeClaimTemplate: + spec: + storageClassName: nfs-client-pokprod + accessModes: ["ReadWriteOnce"] + resources: + requests: + storage: 50Gi + +prometheus: + prometheusSpec: + persistentVolumeClaimRetentionPolicy: + whenDeleted: Retain + whenScaled: Retain + storageSpec: + volumeClaimTemplate: + spec: + storageClassName: nfs-client-pokprod + accessModes: ["ReadWriteOnce"] + resources: + requests: + storage: 50Gi + emptyDir: + medium: Memory + +grafana: + persistence: + enabled: true + type: sts + storageClassName: "nfs-client-pokprod" + accessModes: + - ReadWriteOnce + size: 20Gi + finalizers: + - kubernetes.io/pvc-protection +EOF + +helm upgrade --install kube-prometheus-stack -n prometheus prometheus-community/kube-prometheus-stack --create-namespace -f config.yaml +``` + +If deploying on OpenShift based systems, you need to assign the privileged +security context to the service accounts that are created by the helm chart. + +```sh +oc adm policy add-scc-to-user privileged system:serviceaccount:prometheus:kube-prometheus-stack-admission system:serviceaccount:prometheus:kube-prometheus-stack-alertmanager system:serviceaccount:prometheus:kube-prometheus-stack-grafana system:serviceaccount:prometheus:kube-prometheus-stack-kube-state-metrics system:serviceaccount:prometheus:kube-prometheus-stack-operator system:serviceaccount:prometheus:kube-prometheus-stack-prometheus system:serviceaccount:prometheus:kube-prometheus-stack-prometheus-node-exporter +``` + +You should expect the following pods: + +```sh +kubectl get pods +``` +```sh +NAME READY STATUS RESTARTS AGE +alertmanager-kube-prometheus-stack-alertmanager-0 2/2 Running 0 16m +kube-prometheus-stack-grafana-0 3/3 Running 0 16m +kube-prometheus-stack-kube-state-metrics-6f76b98d89-pxs69 1/1 Running 0 16m +kube-prometheus-stack-operator-7fbfc985bb-mm9bk 1/1 Running 0 16m +kube-prometheus-stack-prometheus-node-exporter-44llp 1/1 Running 0 16m +kube-prometheus-stack-prometheus-node-exporter-95gp8 1/1 Running 0 16m +kube-prometheus-stack-prometheus-node-exporter-dxf5f 1/1 Running 0 16m +kube-prometheus-stack-prometheus-node-exporter-f45dx 1/1 Running 0 16m +kube-prometheus-stack-prometheus-node-exporter-pfrzk 1/1 Running 0 16m +kube-prometheus-stack-prometheus-node-exporter-zpfzb 1/1 Running 0 16m +prometheus-kube-prometheus-stack-prometheus-0 2/2 Running 0 16m +``` + +To access the Grafana dashboard on `localhost:3000`: + +```sh +kubectl --namespace prometheus get secrets kube-prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo +``` +```sh +export POD_NAME=$(kubectl --namespace prometheus get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prometheus-stack" -oname) + kubectl --namespace prometheus port-forward $POD_NAME 3000 +``` + +To import NVidia and Autopilot metrics, from the Grafana dashboard: + +- Select the `+` drop down menu on the top right, and **Import dashboard** +- In the `Grafana.com dashboard URL or ID` box, add + [https://grafana.com/grafana/dashboards/23123-autopilot-metrics/](https://grafana.com/grafana/dashboards/23123-autopilot-metrics/) + and click Load, then repeat with the NVidia dashboard + [https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/) + +To visualize the metrics, we need to label the service monitor objects in both +`autopilot` and `nvidia-GPU-operator` namespaces with the Prometheus release +name. + +```sh +kubectl label servicemonitors.monitoring.coreos.com -n autopilot autopilot-metrics-monitor release=kube-prometheus-stack --overwrite +``` +```sh +kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidia-dcgm-exporter GPU-operator nvidia-node-status-exporter release=kube-prometheus-stack --overwrite +``` + +
+ +## Workload Management + +TODO + +
+ +TODO + +
+ ## Example Workloads -Each example workload below is submitted as an -[AppWrapper](https://project-codeflare.github.io/appwrapper/). See -[USAGE.md](../USAGE.md) for a detailed discussion of queues and workloads in an -MLBatch cluster. +We now run a few example workloads. ### Batch Inference with vLLM @@ -453,6 +508,8 @@ In this example, `alice` runs a batch inference workload using [granite-3.2-8b-instruct](https://huggingface.co/ibm-granite/granite-3.2-8b-instruct) model. +
+ First, `alice` creates a persistent volume claim to cache the model weights on first invocation so that subsequent instantiation of the model will reuse the cached data. @@ -515,11 +572,11 @@ spec: requests: cpu: 4 memory: 64Gi - nvidia.com/gpu: 1 + nvidia.com/GPU: 1 limits: cpu: 4 memory: 64Gi - nvidia.com/gpu: 1 + nvidia.com/GPU: 1 volumeMounts: - name: cache mountPath: /.cache @@ -559,6 +616,24 @@ The two containers are synchronized as follows: `load-generator` waits for `vllm` to be ready to accept requests and, upon completion of the batch, signals `vllm` to make it quit. -## Pre-Training with PyTorch +
+ +### Pre-Training with PyTorch + +TODO + +
TODO + +
+ +### Fine-Tuning with Ray + +TODO + +
+ +TODO + +