Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fill in top-level TODO items #156

Merged
merged 4 commits into from
Mar 20, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions setup.KubeConEU25/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,10 @@ cd mlbatch
# Setup priority classes
kubectl apply -f setup.k8s/mlbatch-priorities.yaml

# Deploy scheduler plugins
# Deploy scheduler-plugins
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/GPU","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"},{"args":{"permitWaitingTimeSeconds":300},"name":"Coscheduling"}]'

# Wait for scheduler-plugins pods to be running
# Wait for scheduler-plugins pods to be ready
while [[ $(kubectl get pods -n scheduler-plugins -o 'jsonpath={..status.conditions[?(@.type=="Ready")].status}' | tr ' ' '\n' | sort -u) != "True" ]]
do
echo -n "." && sleep 1;
Expand Down Expand Up @@ -154,8 +154,6 @@ do
done
echo ""

kubectl get pods -n mlbatch-system

# Deploy AppWrapper
kubectl apply --server-side -k setup.k8s/appwrapper/coscheduling

Expand Down Expand Up @@ -496,7 +494,8 @@ kubectl label servicemonitors.monitoring.coreos.com -n nvidia-GPU-operator nvidi

## Workload Management

TODO
We will now demonstrate the queueing, quota management, and fault recovery capabilities
of MLBatch using synthetic workloads.

<details>

Expand All @@ -506,7 +505,8 @@ TODO

## Example Workloads

We now run a few example workloads.
We now will now run some sample workloads that are representative of what is run on
an AI GPU Cluster.

### Batch Inference with vLLM

Expand Down Expand Up @@ -627,7 +627,8 @@ The two containers are synchronized as follows: `load-generator` waits for

### Pre-Training with PyTorch

TODO
In this example, `alice` uses the [Kubeflow Training Operator](https://github.com/kubeflow/training-operator)
to run a job that uses [PyTorch](https://pytorch.org) to train a machine learning model.

<details>

Expand All @@ -637,7 +638,8 @@ TODO

### Fine-Tuning with Ray

TODO
In this example, `alice` uses [KubeRay](https://github.com/ray-project/kuberay) to run a job that
uses [Ray](https://github.com/ray-project/ray) to fine tune a machine learning model.

<details>

Expand Down