TL/UCP Allgather performance in 1.4.x branch

Running some performance benchmarks on Thor showed that the performance of the knomial algorithm for TL/UCP's allgather seems to be typically worse (~30%) than the UCC 1.3 release when running the OSU microbenchmark suite with >1 PPN. Below are tables for the performance and command lines to reproduce.

2 PPN

| Size | UCC 1.3 | UCC 1.4 |
|--------|-------------|-------------|
|1|6.56|8.74|
|2|6.56|8.70|
|4|6.66|8.98|
|8|6.86|9.13|
|16|7.51|9.50|
|32|7.64|10.32|
|64|8.41|11.04|
|128|9.23|12.62|
|256|11.46|15.20|
|512|14.10|21.78|
|1024|20.34|28.89|
|2048|30.84|39.30|
|4096|55.80|58.05|
|8192|91.32|86.52|
|16384|169.40|142.64|
|32768|297.00|251.90|
|65536|579.30|471.23|
|131072|1098.94|910.38|
|262144|2151.55|1834.34|
|524288|4189.40|5307.78|
|1048576|8655.66|11536.47|

32 PPN

| Size | UCC 1.3 | UCC 1.4 |
|--------|-------------|-------------|
|1|19.28|23.80|
|2|15.64|45.93|
|4|58.74|99.40|
|8|65.78|75.84|
|16|69.69|81.49|
|32|63.51|51.74|
|64|81.11|81.08|
|128|132.74|170.85|
|256|178.80|227.84|
|512|315.59|406.94|
|1024|584.30|731.71|
|2048|1166.37|1439.37|
|4096|2654.10|2941.00|
|8192|5090.83|5545.75|
|16384|10090.45|11664.98|
|32768|20186.39|23996.53|
|65536|40567.93|49084.44|
|131072|81698.88|98689.84|
|262144|164481.83|194390.74|
|524288|330808.08|387310.95|

To reproduce:

Software stack: UCX 1.15.x, UCC (1.3.x or 1.4x branch), OMPI 5.0.x, OSU microbenchmark suite 5.7.1

cmdline: 
`mpirun -np 1024 --map-by node --bind-to core --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_LOG_LEVEL=fatal -x UCC_LOG_LEVEL=fatal -x UCC_CLS=basic -x UCC_TL_UCP_TUNE=allgather:0-inf:@0 ./mpi/collective/osu_allgather`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TL/UCP Allgather performance in 1.4.x branch #1125

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Size	UCC 1.3	UCC 1.4
1	6.56	8.74
2	6.56	8.70
4	6.66	8.98
8	6.86	9.13
16	7.51	9.50
32	7.64	10.32
64	8.41	11.04
128	9.23	12.62
256	11.46	15.20
512	14.10	21.78
1024	20.34	28.89
2048	30.84	39.30
4096	55.80	58.05
8192	91.32	86.52
16384	169.40	142.64
32768	297.00	251.90
65536	579.30	471.23
131072	1098.94	910.38
262144	2151.55	1834.34
524288	4189.40	5307.78
1048576	8655.66	11536.47

Size	UCC 1.3	UCC 1.4
1	19.28	23.80
2	15.64	45.93
4	58.74	99.40
8	65.78	75.84
16	69.69	81.49
32	63.51	51.74
64	81.11	81.08
128	132.74	170.85
256	178.80	227.84
512	315.59	406.94
1024	584.30	731.71
2048	1166.37	1439.37
4096	2654.10	2941.00
8192	5090.83	5545.75
16384	10090.45	11664.98
32768	20186.39	23996.53
65536	40567.93	49084.44
131072	81698.88	98689.84
262144	164481.83	194390.74
524288	330808.08	387310.95

TL/UCP Allgather performance in 1.4.x branch #1125

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions