Skip to content

AllReduce with SHARP TL not being selected #1164

@NiharKod

Description

@NiharKod

Hi,

I'm currently working on implementing UCC as a CCL backend into MPICH.

I'm currently benchmarking my code with OSU Microbenchmark Allreduce latency test with cuda enabled. I've been trying to get the SHARP team layer active with the environment variables, but have not been able to get it to work. It falls back to UCP.

I looked through my logs, and it seems like SHARP is being enabled properly and there were not any errors.

Here are a few lines from my logs that make me believe SHARP is getting initialized for the most part.

[gpu1:0:3114771 - context.c:1166][2025-07-24 13:28:01] DEBUG SAT tree_idx:1 rail_idx:0 endpoint created on device :mlx5_0 port:1
[gpu1:0:3114771 - utils/mpool.c:120][2025-07-24 13:28:01] DEBUG mpool sharp_buffer_mpool: align 128, maxelems 4294967295, elemsize 1616
[gpu1:0:3114771 - utils/mpool.c:120][2025-07-24 13:28:01] DEBUG mpool sharp_coll_reqs: align 128, maxelems 4294967295, elemsize 176
[gpu1:0:3114771 - utils/mpool.c:120][2025-07-24 13:28:01] DEBUG mpool sharp_coll_handles: align 128, maxelems 4294967295, elemsize 336
[gpu4:14:211743 - context.c:359][2025-07-24 13:28:01] DEBUG sharp_coll initialized. job_id: 23068493847765 init_time: 1355850.947
[

After the log I posted above, I got the COLL_SCORE_MAP and noticed that TL_SHARP was not an option for the Allreduce section.

[1753381681.974382] [gpu1:3114771:0] ucc_coll_score_map.c:225  UCC  INFO  Allreduce:
[1753381681.974382] [gpu1:3114771:0] ucc_coll_score_map.c:225  UCC  INFO  	Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 
[1753381681.974382] [gpu1:3114771:0] ucc_coll_score_map.c:225  UCC  INFO  	Cuda: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 
[1753381681.974382] [gpu1:3114771:0] ucc_coll_score_map.c:225  UCC  INFO  	CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10 

And all of the logs for the actual collective operation say it is using UCP.

[1753381681.982586] [gpu1:3114771:0]        ucc_coll.c:301  UCC_COLL INFO  coll_init: Allreduce sum: src={0x14faac000000, 1, int32, Cuda}, dst={0x14faa8000000, 1, int32, Cuda}; CL_BASIC {TL_UCP}, team_id 1

Here is my run script

#!/bin/bash

MPICH_PATH=$HOME/mpich_hpcx/build/install
MPICH_KEN_PATH=/home/raffenet/software/hydra
CUDA_LIB=$CUDA_HOME/lib64
UCX_LIB=$HPCX_UCC_DIR/lib
UCC_LIB=$HPCX_UCC_DIR/lib

export LD_LIBRARY_PATH=$CUDA_LIB:$UCX_LIB:$UCC_LIB:$LD_LIBRARY_PATH
export PATH="/home/raffenet/software/hydra/bin":$PATH

 echo $PBS_NODEFILE
 cat $PBS_NODEFILE
 ${MPICH_KEN_PATH}/bin/mpiexec \
	-launcher pbs \
 	-f "$PBS_NODEFILE" \
 	-n 16 -ppn 8 \
 	-genv LD_LIBRARY_PATH=$CUDA_LIB:$MPICH_PATH/lib/:$LD_LIBRARY_PATH \
 	-genv UCC_LOG_LEVEL=INFO \
 	-genv UCC_COLL_TRACE=INFO \
 	-genv UCC_TL_SHARP_TUNE="allreduce:inf" \
	-genv UCC_TL_SHARP_MIN_TEAM_SIZE=2 \
	-genv UCC_CL_BASIC_TLS=sharp,ucp \
	-genv UCC_TL_SHARP_REG_THRESH=0 \
	-genv SHARP_COLL_ENABLE_SAT=1 \
	-genv SHARP_COLL_SAT_THRESHOLD=4 \
	-genv SHARP_COLL_LOG_LEVEL=4 \
 	-genv MPIR_CVAR_DEVICE_COLLECTIVES=none \
 	-genv MPIR_CVAR_ALLREDUCE_INTRA_ALGORITHM=ccl \
 	-genv MPIR_CVAR_ALLREDUCE_CCL=ucc \
 	-genv MPIR_CVAR_CH4_OFI_ENABLE_HMEM=1 \
	./install/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -d cuda -m 67108864

Additional setup information:

Running on 2 nodes with 8 GPU's per node, NVIDIA A100.

I'm using a custom build of MPICH that is using a UCC backend, I am also using the HPCX builds for UCC, UCX, SHARP etc.

Please let me know if there is any additional information I can provide.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions