-
Notifications
You must be signed in to change notification settings - Fork 890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected low performance with libfabirc/CXI and openmpi #13112
Comments
@amirshehataornl This PR here #13006 added support for alltoall (and bcast, allgather, reduce_scatter) operations into the coll/accelerator component. This targets short(er) messages in device buffers. Given that the performance of large message size is about the same in your measurements between 5.0 and main branch, I am pretty sure that you are observing the benefits of the new feature, which is expected to be part of the 6.0 release of Open MPI. |
@edgargabriel, I'm not entirely sure i follow your comment. The second set of performance numbers is for CPU buffers, which seems a lot better than GPU buffer. Is that to be expected? |
@amirshehataornl ok, that was not clear based on your output, since the PR that I referred to also provides significant improvements for alltoall operations on device buffers. But fundamentally yes, the latency of system/CPU memory communication is much lower than communication of device memory. This stems from both, hardware aspects (HBM latency is higher than DDR latency), as well as the software aspects, i.e. the protocol used internally by communication libraries. For example GPU-IPC handle exchange and attaching to remote GPU memory is expensive and typically only worthwhile starting from a certain message length. I am not 100% sure what libfabric does, but typically the fastest way to deal with device-to-device transfers for short messages is to copy it into system buffers and perform the communication on systems buffers. This implies however that you will have at least two additional device-host transfers compared host-host transfers. |
@edgargabriel, thanks for the info. I'll do some more digging on my side. Question however, is the performance improvement within the realm of the collective algorithms? IE we should still see benefits while using libfabric for internode communication? |
btw. what I said in the previous comment is only correct for shared memory intra-node communication. For inter-node communication, performance of system and device buffer should be the same on a system that has everything set up correctly. |
Question
Unexpected low performance with osu_alltoall and ROCM GPU buffers on frontier like system.
I'm seeing the below performance with the following environment variables set:
I'm using the main branch from open MPI and libfabric. Is there an explanation for the lower than expected performance numbers.
In comparison, below is the performance when using system buffers. System buffers give better performance.
This is using main branch of open MPI and libfabric. Is there an explanation for the lower than expected performance numbers. @iziemba
The text was updated successfully, but these errors were encountered: