something about atomic gemm

I have some questions about atomic GEMM and would like to ask for some explanation. While reading and analyzing the specific implementation of CommOverlapP2PBase::atomic_gemm_overlap_rs, I encountered a question. For example, in the case of two ranks:

- Rank 0 needs to first compute chunk1 and send it to Rank 1, where it will be reduced with the chunk1 computed by Rank 1 itself.

- Rank 1 needs to first compute chunk0 and send it to Rank 0, where it will be reduced with the chunk0 computed by Rank 0 itself.

However, in the current implementation，https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp#L981-L1000, both ranks start their P2P communication from chunk0. Wouldn't this cause a problem? Or is there something wrong with my understanding?

Looking forward to your reply. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

something about atomic gemm #1760

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

something about atomic gemm #1760

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions