-
Notifications
You must be signed in to change notification settings - Fork 458
Description
Hi team,
I am running experiments on overlapping communication and GEMM with atomic gemm.
I am using 8 H100 GPUs, DGX,
I am using the following command to run the docker container, I am using nvcr.io/nvidia/pytorch:25.06-py3
docker run --gpus all --rm --network=host --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it -v "$HOME/workspace:/workspace" nvcr.io/nvidia/pytorch:25.06-py3
Then I uninstall the existing TE package and installed release_v2.4 in the container.
pip uninstall -y transformer-engine
git clone --branch release_v2.4 --recursive https://github.com/NVIDIA/TransformerEngine.git
export NVTE_FRAMEWORK=pytorch
export CUDA_DEVICE_MAX_CONNECTIONS=1
MPI_HOME=/usr/local/mpi NVTE_WITH_USERBUFFERS=1 pip3 install --no-build-isolation -e .
I changed the default configuration to "atomic_gemm": True in the following line:
default_cfg = { |
and also changed to use ring_exchange for all layers by changing the methods while leaving the bulks as is:
"ring_exchange": ["qkv_fprop", "fc1_fprop", "proj_dgrad", "fc2_dgrad", "proj_fprop", "fc2_fprop"],
"pipeline": [],
"bulk": ["qkv_dgrad", "qkv_wgrad", "fc1_dgrad", "fc1_wgrad"],
methods = { |
Then, when I run the te_layer_with_overlap.py in examples/pytorch/comm_gemm_overlap with the following command:
torchrun --nproc-per-node=8 te_layer_with_overlap.py --fp8 --debug
the forward hangs and never stops. I suspect there is an deadlock issue with producer and consumer in the atomic gemm.
I suspect the atomic gemm is only supported with fp8. Are there some configuration missing from my side to run Atomic GEMM with FP8, or this feature is not ready yet? Thank you.