Skip to content

TL/UCP Allgather performance in 1.4.x branch #1125

@wfaderhold21

Description

@wfaderhold21

Running some performance benchmarks on Thor showed that the performance of the knomial algorithm for TL/UCP's allgather seems to be typically worse (~30%) than the UCC 1.3 release when running the OSU microbenchmark suite with >1 PPN. Below are tables for the performance and command lines to reproduce.

2 PPN

Size UCC 1.3 UCC 1.4
1 6.56 8.74
2 6.56 8.70
4 6.66 8.98
8 6.86 9.13
16 7.51 9.50
32 7.64 10.32
64 8.41 11.04
128 9.23 12.62
256 11.46 15.20
512 14.10 21.78
1024 20.34 28.89
2048 30.84 39.30
4096 55.80 58.05
8192 91.32 86.52
16384 169.40 142.64
32768 297.00 251.90
65536 579.30 471.23
131072 1098.94 910.38
262144 2151.55 1834.34
524288 4189.40 5307.78
1048576 8655.66 11536.47

32 PPN

Size UCC 1.3 UCC 1.4
1 19.28 23.80
2 15.64 45.93
4 58.74 99.40
8 65.78 75.84
16 69.69 81.49
32 63.51 51.74
64 81.11 81.08
128 132.74 170.85
256 178.80 227.84
512 315.59 406.94
1024 584.30 731.71
2048 1166.37 1439.37
4096 2654.10 2941.00
8192 5090.83 5545.75
16384 10090.45 11664.98
32768 20186.39 23996.53
65536 40567.93 49084.44
131072 81698.88 98689.84
262144 164481.83 194390.74
524288 330808.08 387310.95

To reproduce:

Software stack: UCX 1.15.x, UCC (1.3.x or 1.4x branch), OMPI 5.0.x, OSU microbenchmark suite 5.7.1

cmdline:
mpirun -np 1024 --map-by node --bind-to core --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_LOG_LEVEL=fatal -x UCC_LOG_LEVEL=fatal -x UCC_CLS=basic -x UCC_TL_UCP_TUNE=allgather:0-inf:@0 ./mpi/collective/osu_allgather

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions