comm_gemm_overlap with atomic gemm + fp8 hangs

Hi team, 

I am running experiments on overlapping communication and GEMM with atomic gemm.

I am using 8 H100 GPUs, DGX,
I am using the following command to run the docker container, I am using nvcr.io/nvidia/pytorch:25.06-py3

`docker run --gpus all --rm --network=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it -v "$HOME/workspace:/workspace" nvcr.io/nvidia/pytorch:25.06-py3`

Then I uninstall the existing TE package and installed release_v2.4 in the container.


`pip uninstall -y transformer-engine`
`git clone --branch release_v2.4 --recursive https://github.com/NVIDIA/TransformerEngine.git`
`export NVTE_FRAMEWORK=pytorch`
`export CUDA_DEVICE_MAX_CONNECTIONS=1`
`MPI_HOME=/usr/local/mpi NVTE_WITH_USERBUFFERS=1 pip3 install --no-build-isolation -e .`


I changed the default configuration to "atomic_gemm": True in the following line:

https://github.com/NVIDIA/TransformerEngine/blob/6f4310d700f7445fd12d524c645b1e72fb8886f7/transformer_engine/pytorch/module/base.py#L283

and also changed to use ring_exchange for all layers by changing the methods while leaving the bulks as is:

`"ring_exchange": ["qkv_fprop", "fc1_fprop", "proj_dgrad", "fc2_dgrad", "proj_fprop", "fc2_fprop"],`
`"pipeline": [],`
`"bulk": ["qkv_dgrad", "qkv_wgrad", "fc1_dgrad", "fc1_wgrad"],`

https://github.com/NVIDIA/TransformerEngine/blob/6f4310d700f7445fd12d524c645b1e72fb8886f7/transformer_engine/pytorch/module/base.py#L259

Then, when I run the te_layer_with_overlap.py in examples/pytorch/comm_gemm_overlap with the following command:
`torchrun --nproc-per-node=8 te_layer_with_overlap.py --fp8 --debug`

the forward hangs and never stops. I suspect there is an deadlock issue with producer and consumer in the atomic gemm. 

I suspect the atomic gemm is only supported with fp8. Are there some configuration missing from my side to run Atomic GEMM with FP8, or this feature is not ready yet? Thank you. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

comm_gemm_overlap with atomic gemm + fp8 hangs #1918

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

comm_gemm_overlap with atomic gemm + fp8 hangs #1918

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions