-
Hi! I have 640 GPUs on 80 machines to do some pretrain work now, I have read the zero tutorials but didn't see any config about number of partitions. My question is, if I start traing on all 80 machines with zero stage 3(without any model parrallel), will model-parameters be split to 640(dp 640?) parts loaded on each GPU? Waiting for reply, thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 6 replies
-
@cccc0der, yes, the vanilla zero3 will split each parameter across the 640 GPUs. However, we are integrating zero3 improvements that split across a subset of DP GPUs. One such called miCS is already available in the 0.9.2 release. A second algorithm will be out soon, so be on the lookout. @samadejacobs, FYI. |
Beta Was this translation helpful? Give feedback.
-
@cccc0der, please open an issue for this. |
Beta Was this translation helpful? Give feedback.
-
@cccc0der, calling MiCS Init in your code should fix reported error: |
Beta Was this translation helpful? Give feedback.
@cccc0der, yes, the vanilla zero3 will split each parameter across the 640 GPUs. However, we are integrating zero3 improvements that split across a subset of DP GPUs. One such called miCS is already available in the 0.9.2 release. A second algorithm will be out soon, so be on the lookout.
@samadejacobs, FYI.