Commit 3e871f0
committed
Adding the circulant graph queued variable ring algorithm for Bcast.
This algorithm achieves better performance than existing algorithms for both small and large message sizes.
The algorithms is based on the circulant graph abstraction and Jesper Larsson Traff's recent paper: https://dl.acm.org/doi/full/10.1145/3735139.
It creates communication schedules around various rings in the circulant graph, then repeats the schedule to pipeline message chunks.
We introduce a FIFO queue for overlapping sends and receives across communication rounds, which particularly benefits small messages.
In the graph below, we show the algorithm's performance for a fixed chunk size (256k) and queue length (24) for various scales on ANL Aurora (N, PPN).
The baseline for this graph is the best-performing algorithm currently in MPICH, so all speedups represent improvements over all algorithms currently in the library.
We note that the performance drops around our selected chunk size (256k).
By tuning the chunk size near this message size, it is possible to achieve a speedup across all message sizes for all scales.1 parent 7fcdc20 commit 3e871f0
File tree
7 files changed
+463
-1
lines changed- src/mpi/coll
- bcast
- include
- src
- test/mpi/maint
7 files changed
+463
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| 16 | + | |
16 | 17 | | |
17 | 18 | | |
18 | 19 | | |
| |||
0 commit comments