-
Notifications
You must be signed in to change notification settings - Fork 114
Open
Description
Running some performance benchmarks on Thor showed that the performance of the knomial algorithm for TL/UCP's allgather seems to be typically worse (~30%) than the UCC 1.3 release when running the OSU microbenchmark suite with >1 PPN. Below are tables for the performance and command lines to reproduce.
2 PPN
Size | UCC 1.3 | UCC 1.4 |
---|---|---|
1 | 6.56 | 8.74 |
2 | 6.56 | 8.70 |
4 | 6.66 | 8.98 |
8 | 6.86 | 9.13 |
16 | 7.51 | 9.50 |
32 | 7.64 | 10.32 |
64 | 8.41 | 11.04 |
128 | 9.23 | 12.62 |
256 | 11.46 | 15.20 |
512 | 14.10 | 21.78 |
1024 | 20.34 | 28.89 |
2048 | 30.84 | 39.30 |
4096 | 55.80 | 58.05 |
8192 | 91.32 | 86.52 |
16384 | 169.40 | 142.64 |
32768 | 297.00 | 251.90 |
65536 | 579.30 | 471.23 |
131072 | 1098.94 | 910.38 |
262144 | 2151.55 | 1834.34 |
524288 | 4189.40 | 5307.78 |
1048576 | 8655.66 | 11536.47 |
32 PPN
Size | UCC 1.3 | UCC 1.4 |
---|---|---|
1 | 19.28 | 23.80 |
2 | 15.64 | 45.93 |
4 | 58.74 | 99.40 |
8 | 65.78 | 75.84 |
16 | 69.69 | 81.49 |
32 | 63.51 | 51.74 |
64 | 81.11 | 81.08 |
128 | 132.74 | 170.85 |
256 | 178.80 | 227.84 |
512 | 315.59 | 406.94 |
1024 | 584.30 | 731.71 |
2048 | 1166.37 | 1439.37 |
4096 | 2654.10 | 2941.00 |
8192 | 5090.83 | 5545.75 |
16384 | 10090.45 | 11664.98 |
32768 | 20186.39 | 23996.53 |
65536 | 40567.93 | 49084.44 |
131072 | 81698.88 | 98689.84 |
262144 | 164481.83 | 194390.74 |
524288 | 330808.08 | 387310.95 |
To reproduce:
Software stack: UCX 1.15.x, UCC (1.3.x or 1.4x branch), OMPI 5.0.x, OSU microbenchmark suite 5.7.1
cmdline:
mpirun -np 1024 --map-by node --bind-to core --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_LOG_LEVEL=fatal -x UCC_LOG_LEVEL=fatal -x UCC_CLS=basic -x UCC_TL_UCP_TUNE=allgather:0-inf:@0 ./mpi/collective/osu_allgather
Metadata
Metadata
Assignees
Labels
No labels