how many partitios will zero stage 3 divide mode #3438

cccc0der · 2023-05-04T03:55:54Z

cccc0der
May 4, 2023

Hi!

I have 640 GPUs on 80 machines to do some pretrain work now, I have read the zero tutorials but didn't see any config about number of partitions.

My question is, if I start traing on all 80 machines with zero stage 3(without any model parrallel), will model-parameters be split to 640(dp 640?) parts loaded on each GPU?

Waiting for reply, thank you!

Answered by tjruwase

May 4, 2023

@cccc0der, yes, the vanilla zero3 will split each parameter across the 640 GPUs. However, we are integrating zero3 improvements that split across a subset of DP GPUs. One such called miCS is already available in the 0.9.2 release. A second algorithm will be out soon, so be on the lookout.

@samadejacobs, FYI.

View full answer

tjruwase · 2023-05-04T16:19:51Z

tjruwase
May 4, 2023
Maintainer

@cccc0der, yes, the vanilla zero3 will split each parameter across the 640 GPUs. However, we are integrating zero3 improvements that split across a subset of DP GPUs. One such called miCS is already available in the 0.9.2 release. A second algorithm will be out soon, so be on the lookout.

@samadejacobs, FYI.

3 replies

cccc0der May 5, 2023
Author

thank you!

yunll May 5, 2023

I use miCS config like this

I get this error

JianqiaoLu Jul 21, 2023

do you solve this problem ? I also come across this,,,

tjruwase · 2023-05-07T17:52:10Z

tjruwase
May 7, 2023
Maintainer

@cccc0der, please open an issue for this.

1 reply

cccc0der May 8, 2023
Author

done, please check it. #3485

samadejacobs · 2023-05-08T23:09:41Z

samadejacobs
May 8, 2023
Collaborator

@cccc0der, calling MiCS Init in your code should fix reported error:
with mics.MiCS_Init(data_parallel_group......)

2 replies

learning-chip Jun 3, 2023

Hit the same bug and found here. Yes the error AttributeError: 'Parameter' object has no attribute 'comm' can be fixed by using with deepspeed.zero.MiCS_Init(config_dict_or_path=deepspeed_config): rather than with deepspeed.zero.Init():. The config_dict_or_path is required in order to know mics_shard_size at init stage.

This behavior is currently not tested in test_mics_config.py. Better add it to unit test?

JianqiaoLu Jul 21, 2023

basically, how can i alter my code to run like mics_init instead of zero init? do you mean I need to change deepspeed internal code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how many partitios will zero stage 3 divide mode #3438

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

how many partitios will zero stage 3 divide mode #3438

cccc0der May 4, 2023

Replies: 3 comments · 6 replies

tjruwase May 4, 2023 Maintainer

cccc0der May 5, 2023 Author

yunll May 5, 2023

JianqiaoLu Jul 21, 2023

tjruwase May 7, 2023 Maintainer

cccc0der May 8, 2023 Author

samadejacobs May 8, 2023 Collaborator

learning-chip Jun 3, 2023

JianqiaoLu Jul 21, 2023

cccc0der
May 4, 2023

Replies: 3 comments 6 replies

tjruwase
May 4, 2023
Maintainer

cccc0der May 5, 2023
Author

tjruwase
May 7, 2023
Maintainer

cccc0der May 8, 2023
Author

samadejacobs
May 8, 2023
Collaborator