-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608
Comments
Does volcano offer an API to declare the size of the group beforehand? Otherwise, there is nothing we can do in this repo. You might also want to consider https://kueue.sig.k8s.io which doesn't face this issue because it's not pod-based. |
@yzhao-2023 That's right,
Anyway, we should add documentation about
I don't want to add such a defaulting since users might be confused by the modified input value. I belive that validation would be better.
@alculquicondor We can tell an arbitrary number to the volcano via PodGroup (runPolicy.minAvailable) here: mpi-operator/pkg/controller/podgroup.go Lines 130 to 131 in 4a63d3c
|
What I mean is whether we can tell volcano that X pods of a shape are coming, so that it reserves the space for them. |
Ah, I see. yes, that's right. We don't have any way to tell a shape to volcano/scheduler-plugins. |
If WaitForWorkersReady is enabled, MPI operator and a gang scheduler would be stuck in a deadlock:
A workaround, albeit still violating gang scheduling's semantic, is to set
runPolicy.minAvailable
to be the worker count, allowing mpi operator to create pod group with only worker pods, and allowing gang scheduler to proceed scheduling workers.The problem is the strict semantic of gang scheduling is being broken, and the launcher might be able to be scheduled.
In reality, this should not be a problem, as launcher job does not consume gpus, therefore should be amply available in our case.
But the doc should be updated to reflect this pitfall.
A better fix might be to change the default behavior to only create a pod group with N (N being worker pod count).
Risking launcher not be started.
A possible true fix:
Extend Kubernetes to have resources being allocated, but not immediately start running the pods.
So that launcher can be executed after workers have been started.
[0] https://www.kubeflow.org/docs/components/training/mpi/#scheduling-policy
[1] https://www.alibabacloud.com/blog/the-burgeoning-kubernetes-scheduling-system-part-2-coscheduling-and-gang-scheduling-that-support-batch-jobs_597319
The text was updated successfully, but these errors were encountered: