how can i deploy distributed training on kubernete clusters with torch.distributed.launch #560

ThomaswellY · 2023-06-05T07:32:13Z

I have been using mmpretrain project https://github.com/open-mmlab/mmpretrain, which consists of abundant of classification scripts. However, they use torch.distributed.launch to start distributed training, I wonder is there any method under kubeflow operators to start such distributed training on k8s cluster?
PS: i have seeked help to training-operator and pytorch-operator, but can't see obvious solution.
Thanks in advance~ any hints would be helpful to me.

alculquicondor · 2023-06-05T12:27:10Z

Are you in the wrong repo?
This repo is about MPI. Pytorch is supported in https://github.com/kubeflow/training-operator

tenzen-y · 2023-06-05T14:56:59Z

@ThomaswellY If you would run torchrun, you should open an issues at training-operator repo. If you would run Distributed Pytorch Training with mpirun, we can answer your questions at this repo.

Which commands do you mean?

ThomaswellY · 2023-06-06T02:57:54Z

@alculquicondor @tenzen-y
I was looking to how to modify the original script which originally use torch.distributed.launch to start training to use mpirun to start training in mpi-operator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how can i deploy distributed training on kubernete clusters with torch.distributed.launch #560

how can i deploy distributed training on kubernete clusters with torch.distributed.launch #560

ThomaswellY commented Jun 5, 2023 •

edited

Loading

alculquicondor commented Jun 5, 2023

tenzen-y commented Jun 5, 2023

ThomaswellY commented Jun 6, 2023 •

edited

Loading

how can i deploy distributed training on kubernete clusters with torch.distributed.launch #560

how can i deploy distributed training on kubernete clusters with torch.distributed.launch #560

Comments

ThomaswellY commented Jun 5, 2023 • edited Loading

alculquicondor commented Jun 5, 2023

tenzen-y commented Jun 5, 2023

ThomaswellY commented Jun 6, 2023 • edited Loading

ThomaswellY commented Jun 5, 2023 •

edited

Loading

ThomaswellY commented Jun 6, 2023 •

edited

Loading