Description
I have some questions about atomic GEMM and would like to ask for some explanation. While reading and analyzing the specific implementation of CommOverlapP2PBase::atomic_gemm_overlap_rs, I encountered a question. For example, in the case of two ranks:
-
Rank 0 needs to first compute chunk1 and send it to Rank 1, where it will be reduced with the chunk1 computed by Rank 1 itself.
-
Rank 1 needs to first compute chunk0 and send it to Rank 0, where it will be reduced with the chunk0 computed by Rank 0 itself.
However, in the current implementation,https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp#L981-L1000, both ranks start their P2P communication from chunk0. Wouldn't this cause a problem? Or is there something wrong with my understanding?
Looking forward to your reply. Thank you!