Skip to content

comm_gemm_overlap with atomic gemm + fp8 hangs #1918

@ys0100

Description

@ys0100

Hi team,

I am running experiments on overlapping communication and GEMM with atomic gemm.

I am using 8 H100 GPUs, DGX,
I am using the following command to run the docker container, I am using nvcr.io/nvidia/pytorch:25.06-py3

docker run --gpus all --rm --network=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it -v "$HOME/workspace:/workspace" nvcr.io/nvidia/pytorch:25.06-py3

Then I uninstall the existing TE package and installed release_v2.4 in the container.

pip uninstall -y transformer-engine
git clone --branch release_v2.4 --recursive https://github.com/NVIDIA/TransformerEngine.git
export NVTE_FRAMEWORK=pytorch
export CUDA_DEVICE_MAX_CONNECTIONS=1
MPI_HOME=/usr/local/mpi NVTE_WITH_USERBUFFERS=1 pip3 install --no-build-isolation -e .

I changed the default configuration to "atomic_gemm": True in the following line:

and also changed to use ring_exchange for all layers by changing the methods while leaving the bulks as is:

"ring_exchange": ["qkv_fprop", "fc1_fprop", "proj_dgrad", "fc2_dgrad", "proj_fprop", "fc2_fprop"],
"pipeline": [],
"bulk": ["qkv_dgrad", "qkv_wgrad", "fc1_dgrad", "fc1_wgrad"],

Then, when I run the te_layer_with_overlap.py in examples/pytorch/comm_gemm_overlap with the following command:
torchrun --nproc-per-node=8 te_layer_with_overlap.py --fp8 --debug

the forward hangs and never stops. I suspect there is an deadlock issue with producer and consumer in the atomic gemm.

I suspect the atomic gemm is only supported with fp8. Are there some configuration missing from my side to run Atomic GEMM with FP8, or this feature is not ready yet? Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions