Unexpected low performance with libfabirc/CXI and openmpi #13112

amirshehataornl · 2025-02-23T18:22:13Z

Question
Unexpected low performance with osu_alltoall and ROCM GPU buffers on frontier like system.

I'm seeing the below performance with the following environment variables set:

export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_RDZV_EAGER_SIZE=2048
export FI_CXI_OFLOW_BUF_SIZE=12582912
export FI_CXI_OFLOW_BUF_COUNT=3
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_REQ_BUF_MAX_CACHED=0
export FI_CXI_REQ_BUF_MIN_POSTED=6
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_CXI_RX_MATCH_MODE=software
export FI_MR_CACHE_MAX_SIZE=-1
export FI_MR_CACHE_MAX_COUNT=524288

mpirun -x FI_OFI_RXM_ENABLE_SHM -x FI_LOG_LEVEL -x FI_CXI_RDZV_THRESHOLD -x FI_CXI_RDZV_EAGER_SIZE -x FI_CXI_OFLOW_BUF_SIZE -x FI_CXI_OFLOW_BUF_COUNT -x FI_CXI_DEFAULT_CQ_SIZE -x FI_CXI_REQ_BUF_MAX_CACHED -x FI_CXI_REQ_BUF_MIN_POSTED -x FI_CXI_REQ_BUF_SIZE -x FI_CXI_RX_MATCH_MODE -x FI_MR_CACHE_MAX_SIZE -x FI_MR_CACHE_MAX_COUNT -x FI_LNX_SRQ_SUPPORT -x FI_SHM_USE_XPMEM -x LD_LIBRARY_PATH --mca mtl_ofi_av table --display mapping,bindings --mca btl '^tcp,ofi,vader,openib' --mca pml '^ucx' --mca mtl ofi --mca opal_common_ofi_provider_include cxi --bind-to core --map-by ppr:1:l3 -np 16 /sw/crusher/ums/ompix/DEVELOP/cce/13.0.0/install/osu-micro-benchmarks-7.5-1//build-ompi/_install/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -d rocm D D


1 69.69
2 69.73
4 686.55
8 684.56
16 685.98
32 687.88
64 690.14
128 696.51
256 700.48
512 60.52
1024 61.01
2048 62.00
4096 62.81
8192 5429.99
16384 280.84
32768 100.29
65536 152.72
131072 258.75
262144 509.10
524288 996.54
1048576 2052.14

I'm using the main branch from open MPI and libfabric. Is there an explanation for the lower than expected performance numbers.

In comparison, below is the performance when using system buffers. System buffers give better performance.

1 40.33
2 40.22
4 38.57
8 38.85
16 39.46
32 43.24
64 43.58
128 47.72
256 45.92
512 41.34
1024 41.57
2048 43.83
4096 46.32
8192 48.40
16384 67.62
32768 93.77
65536 150.25
131072 267.46
262144 583.49
524288 1145.67
1048576 2522.32

This is using main branch of open MPI and libfabric. Is there an explanation for the lower than expected performance numbers. @iziemba

The text was updated successfully, but these errors were encountered:

edgargabriel · 2025-02-24T01:06:59Z

@amirshehataornl This PR here #13006 added support for alltoall (and bcast, allgather, reduce_scatter) operations into the coll/accelerator component. This targets short(er) messages in device buffers. Given that the performance of large message size is about the same in your measurements between 5.0 and main branch, I am pretty sure that you are observing the benefits of the new feature, which is expected to be part of the 6.0 release of Open MPI.

amirshehataornl · 2025-02-24T15:44:59Z

@edgargabriel, I'm not entirely sure i follow your comment. The second set of performance numbers is for CPU buffers, which seems a lot better than GPU buffer. Is that to be expected?

edgargabriel · 2025-02-24T15:51:33Z

@amirshehataornl ok, that was not clear based on your output, since the PR that I referred to also provides significant improvements for alltoall operations on device buffers.

But fundamentally yes, the latency of system/CPU memory communication is much lower than communication of device memory. This stems from both, hardware aspects (HBM latency is higher than DDR latency), as well as the software aspects, i.e. the protocol used internally by communication libraries. For example GPU-IPC handle exchange and attaching to remote GPU memory is expensive and typically only worthwhile starting from a certain message length. I am not 100% sure what libfabric does, but typically the fastest way to deal with device-to-device transfers for short messages is to copy it into system buffers and perform the communication on systems buffers. This implies however that you will have at least two additional device-host transfers compared host-host transfers.

amirshehataornl · 2025-02-24T22:53:15Z

@edgargabriel, thanks for the info. I'll do some more digging on my side. Question however, is the performance improvement within the realm of the collective algorithms? IE we should still see benefits while using libfabric for internode communication?

edgargabriel · 2025-02-25T00:02:13Z

btw. what I said in the previous comment is only correct for shared memory intra-node communication. For inter-node communication, performance of system and device buffer should be the same on a system that has everything set up correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected low performance with libfabirc/CXI and openmpi #13112

Unexpected low performance with libfabirc/CXI and openmpi #13112

amirshehataornl commented Feb 23, 2025 •

edited

Loading

edgargabriel commented Feb 24, 2025 •

edited

Loading

amirshehataornl commented Feb 24, 2025

edgargabriel commented Feb 24, 2025 •

edited

Loading

amirshehataornl commented Feb 24, 2025 •

edited

Loading

edgargabriel commented Feb 25, 2025

Unexpected low performance with libfabirc/CXI and openmpi #13112

Unexpected low performance with libfabirc/CXI and openmpi #13112

Comments

amirshehataornl commented Feb 23, 2025 • edited Loading

edgargabriel commented Feb 24, 2025 • edited Loading

amirshehataornl commented Feb 24, 2025

edgargabriel commented Feb 24, 2025 • edited Loading

amirshehataornl commented Feb 24, 2025 • edited Loading

edgargabriel commented Feb 25, 2025

amirshehataornl commented Feb 23, 2025 •

edited

Loading

edgargabriel commented Feb 24, 2025 •

edited

Loading

edgargabriel commented Feb 24, 2025 •

edited

Loading

amirshehataornl commented Feb 24, 2025 •

edited

Loading