-
Notifications
You must be signed in to change notification settings - Fork 911
Description
When running our atmospheric model in parallel we observe a significant performance drop with OpenMPI compared to Intel MPI + Intel compiler. Profiling shows that almost all of the extra time is spent inside memset.
Observations
With Intel MPI + Intel compiler the code runs at expected speed.
With OpenMPI (regardless of which compiler is used to build OpenMPI and the model) the same executable becomes ≥ 2× slower, and the profiler attributes the loss almost entirely to memset.
The problem persists across several recent OpenMPI releases (tested 4.1.x and 5.0.x).
Steps to reproduce
Build the atmospheric model with any compiler (Intel or LLVM) against Intel MPI → run time ≈ T₀ (baseline).
Re-build the identical source against OpenMPI (any recent version) → run time ≈ (2~4)× T₀.
Profile (perf, VTune, or gprof) shows > 80 % of the extra time is consumed by memset.
Environment
OS: RHEL 7
Compilers tested: Intel 2021.11, LLVM 17
MPIs tested:
– Intel MPI 2021.11
– OpenMPI 4.1.6, 5.0.1 (both built from source and distro packages)
Expected behavior
memset cost should remain small regardless of MPI implementation, so OpenMPI performance should match Intel MPI.
Actual behavior
memset becomes the bottleneck under OpenMPI.
Additional notes
No special memset tuning flags are used in either case.
Please let me know if you need any additional information (build options, reproducer, or profiling data).