Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected low performance with libfabirc/CXI and openmpi #13112

Open
amirshehataornl opened this issue Feb 23, 2025 · 5 comments
Open

Unexpected low performance with libfabirc/CXI and openmpi #13112

amirshehataornl opened this issue Feb 23, 2025 · 5 comments

Comments

@amirshehataornl
Copy link
Contributor

amirshehataornl commented Feb 23, 2025

Question
Unexpected low performance with osu_alltoall and ROCM GPU buffers on frontier like system.

I'm seeing the below performance with the following environment variables set:

export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_RDZV_EAGER_SIZE=2048
export FI_CXI_OFLOW_BUF_SIZE=12582912
export FI_CXI_OFLOW_BUF_COUNT=3
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_REQ_BUF_MAX_CACHED=0
export FI_CXI_REQ_BUF_MIN_POSTED=6
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_CXI_RX_MATCH_MODE=software
export FI_MR_CACHE_MAX_SIZE=-1
export FI_MR_CACHE_MAX_COUNT=524288
mpirun -x FI_OFI_RXM_ENABLE_SHM -x FI_LOG_LEVEL -x FI_CXI_RDZV_THRESHOLD -x FI_CXI_RDZV_EAGER_SIZE -x FI_CXI_OFLOW_BUF_SIZE -x FI_CXI_OFLOW_BUF_COUNT -x FI_CXI_DEFAULT_CQ_SIZE -x FI_CXI_REQ_BUF_MAX_CACHED -x FI_CXI_REQ_BUF_MIN_POSTED -x FI_CXI_REQ_BUF_SIZE -x FI_CXI_RX_MATCH_MODE -x FI_MR_CACHE_MAX_SIZE -x FI_MR_CACHE_MAX_COUNT -x FI_LNX_SRQ_SUPPORT -x FI_SHM_USE_XPMEM -x LD_LIBRARY_PATH --mca mtl_ofi_av table --display mapping,bindings --mca btl '^tcp,ofi,vader,openib' --mca pml '^ucx' --mca mtl ofi --mca opal_common_ofi_provider_include cxi --bind-to core --map-by ppr:1:l3 -np 16 /sw/crusher/ums/ompix/DEVELOP/cce/13.0.0/install/osu-micro-benchmarks-7.5-1//build-ompi/_install/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -d rocm D D


1 69.69
2 69.73
4 686.55
8 684.56
16 685.98
32 687.88
64 690.14
128 696.51
256 700.48
512 60.52
1024 61.01
2048 62.00
4096 62.81
8192 5429.99
16384 280.84
32768 100.29
65536 152.72
131072 258.75
262144 509.10
524288 996.54
1048576 2052.14

I'm using the main branch from open MPI and libfabric. Is there an explanation for the lower than expected performance numbers.

In comparison, below is the performance when using system buffers. System buffers give better performance.

1 40.33
2 40.22
4 38.57
8 38.85
16 39.46
32 43.24
64 43.58
128 47.72
256 45.92
512 41.34
1024 41.57
2048 43.83
4096 46.32
8192 48.40
16384 67.62
32768 93.77
65536 150.25
131072 267.46
262144 583.49
524288 1145.67
1048576 2522.32

This is using main branch of open MPI and libfabric. Is there an explanation for the lower than expected performance numbers. @iziemba

@edgargabriel
Copy link
Member

edgargabriel commented Feb 24, 2025

@amirshehataornl This PR here #13006 added support for alltoall (and bcast, allgather, reduce_scatter) operations into the coll/accelerator component. This targets short(er) messages in device buffers. Given that the performance of large message size is about the same in your measurements between 5.0 and main branch, I am pretty sure that you are observing the benefits of the new feature, which is expected to be part of the 6.0 release of Open MPI.

@amirshehataornl
Copy link
Contributor Author

@edgargabriel, I'm not entirely sure i follow your comment. The second set of performance numbers is for CPU buffers, which seems a lot better than GPU buffer. Is that to be expected?

@edgargabriel
Copy link
Member

edgargabriel commented Feb 24, 2025

@amirshehataornl ok, that was not clear based on your output, since the PR that I referred to also provides significant improvements for alltoall operations on device buffers.

But fundamentally yes, the latency of system/CPU memory communication is much lower than communication of device memory. This stems from both, hardware aspects (HBM latency is higher than DDR latency), as well as the software aspects, i.e. the protocol used internally by communication libraries. For example GPU-IPC handle exchange and attaching to remote GPU memory is expensive and typically only worthwhile starting from a certain message length. I am not 100% sure what libfabric does, but typically the fastest way to deal with device-to-device transfers for short messages is to copy it into system buffers and perform the communication on systems buffers. This implies however that you will have at least two additional device-host transfers compared host-host transfers.

@amirshehataornl
Copy link
Contributor Author

amirshehataornl commented Feb 24, 2025

@edgargabriel, thanks for the info. I'll do some more digging on my side. Question however, is the performance improvement within the realm of the collective algorithms? IE we should still see benefits while using libfabric for internode communication?

@edgargabriel
Copy link
Member

btw. what I said in the previous comment is only correct for shared memory intra-node communication. For inter-node communication, performance of system and device buffer should be the same on a system that has everything set up correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants