When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608

yzhao-2023 · 2023-12-10T07:30:46Z

If WaitForWorkersReady is enabled, MPI operator and a gang scheduler would be stuck in a deadlock:

WaitForWorkersReady is enabled, mpi operator created a pod group with only worker pod spec (N being worker pod count), but with the desired pod count being N+1 (N worker + 1 launcher)
Gang scheduler would not scheduler this pod group, because there is not enough pods in the podgroup
MPI operator would not create launcher pod spec, because the worker pods are not created yet.

A workaround, albeit still violating gang scheduling's semantic, is to set runPolicy.minAvailable to be the worker count, allowing mpi operator to create pod group with only worker pods, and allowing gang scheduler to proceed scheduling workers.

The problem is the strict semantic of gang scheduling is being broken, and the launcher might be able to be scheduled.

In reality, this should not be a problem, as launcher job does not consume gpus, therefore should be amply available in our case.

But the doc should be updated to reflect this pitfall.

A better fix might be to change the default behavior to only create a pod group with N (N being worker pod count).
Risking launcher not be started.

A possible true fix:
Extend Kubernetes to have resources being allocated, but not immediately start running the pods.
So that launcher can be executed after workers have been started.

[0] https://www.kubeflow.org/docs/components/training/mpi/#scheduling-policy
[1] https://www.alibabacloud.com/blog/the-burgeoning-kubernetes-scheduling-system-part-2-coscheduling-and-gang-scheduling-that-support-batch-jobs_597319

The text was updated successfully, but these errors were encountered:

alculquicondor · 2023-12-11T14:26:54Z

Does volcano offer an API to declare the size of the group beforehand?

Otherwise, there is nothing we can do in this repo.

You might also want to consider https://kueue.sig.k8s.io which doesn't face this issue because it's not pod-based.

tenzen-y · 2023-12-11T19:32:17Z

If WaitForWorkersReady is enabled, MPI operator and a gang scheduler would be stuck in a deadlock

@yzhao-2023 That's right, WaitForWorkersReady potentially has the deadlock.

But the doc should be updated to reflect this pitfall.

Anyway, we should add documentation about WaitForWorkersReady since there isn't any document about the feature.

A better fix might be to change the default behavior to only create a pod group with N (N being worker pod count).
Risking launcher not be started.

I don't want to add such a defaulting since users might be confused by the modified input value. I belive that validation would be better.

Does volcano offer an API to declare the size of the group beforehand?

@alculquicondor We can tell an arbitrary number to the volcano via PodGroup (runPolicy.minAvailable) here:

mpi-operator/pkg/controller/podgroup.go

Lines 130 to 131 in 4a63d3c

    
           Spec: volcanov1beta1.PodGroupSpec{ 
        
           	MinMember:         *minMember,

alculquicondor · 2023-12-11T20:00:19Z

What I mean is whether we can tell volcano that X pods of a shape are coming, so that it reserves the space for them.
Otherwise there is no way for mpi-operator to prevent this "race", as volcano is expecting the Pods to be created.

tenzen-y · 2023-12-11T20:27:41Z

What I mean is whether we can tell volcano that X pods of a shape are coming, so that it reserves the space for them. Otherwise there is no way for mpi-operator to prevent this "race", as volcano is expecting the Pods to be created.

Ah, I see. yes, that's right. We don't have any way to tell a shape to volcano/scheduler-plugins.
So, I believe that validations would be worth it. It means users can not create a MPIJob with waitForWorkersReady and N , where N is the sum of all workers and a launcher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608

When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608

yzhao-2023 commented Dec 10, 2023

alculquicondor commented Dec 11, 2023

tenzen-y commented Dec 11, 2023

alculquicondor commented Dec 11, 2023

tenzen-y commented Dec 11, 2023

When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608

When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock #608

Comments

yzhao-2023 commented Dec 10, 2023

alculquicondor commented Dec 11, 2023

tenzen-y commented Dec 11, 2023

alculquicondor commented Dec 11, 2023

tenzen-y commented Dec 11, 2023