You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran some OSU benchmarks and generally things look good. Point to point tests yield the same performance as Cray MPI. However, I stumbled upon a segfault in MPI_Init. Here, I allocated only 1 compute node through slurm. Then:
~/software/openmpi5/bin/mpirun -np 2 ./osu_barrier
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: nid007955
Location: mtl_ofi_component.c:1007
Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007955:08519] *** Process received signal ***
[nid007955:08519] Signal: Segmentation fault (11)
[nid007955:08519] Signal code: Address not mapped (1)
[nid007955:08519] Failing at address: 0x140074656e7a
[nid007955:08519] [ 0] /lib64/libpthread.so.0(+0x16910)[0x14f3d4b66910]
[nid007955:08519] [ 1] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3d0a6)[0x14f3cbe4e0a6]
[nid007955:08519] [ 2] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3cfeb)[0x14f3cbe4dfeb]
[nid007955:08519] [ 3] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x4d7ba)[0x14f3cbe5e7ba]
[nid007955:08519] [ 4] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(fi_fabric+0xa2)[0x14f3cbe2a172]
[nid007955:08519] [ 5] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(+0xa3db4)[0x14f3cbfb4db4]
[nid007955:08519] [ 6] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(mca_btl_base_select+0x14d)[0x14f3cbfa1ddd]
[nid007955:08519] [ 7] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x14f3d503d0c2]
[nid007955:08519] [ 8] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x14f3d503ae54]
[nid007955:08519] [ 9] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x27d34a)[0x14f3d51c634a]
[nid007955:08519] [10] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_pml_base_select+0x1ce)[0x14f3d51c287e]
[nid007955:08519] [11] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x9a92a)[0x14f3d4fe392a]
[nid007955:08519] [12] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_instance_init+0x61)[0x14f3d4fe4081]
[nid007955:08519] [13] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_init+0x96)[0x14f3d4fdb8b6]
[nid007955:08519] [14] /users/makrotki/software/openmpi5/lib/libmpi.so.40(MPI_Init+0x5e)[0x14f3d500d46e]
[nid007955:08519] [15] ./osu_barrier[0x40675d]
[nid007955:08519] [16] ./osu_barrier[0x402810]
[nid007955:08519] [17] /lib64/libc.so.6(__libc_start_main+0xef)[0x14f3d498e24d]
[nid007955:08519] [18] ./osu_barrier[0x402d7a]
[nid007955:08519] *** End of error message ***
I tried with 16 ranks and it sometimes works, sometimes segfaults. But with 2 ranks it segfaults always. Note that I always see this message:
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: nid007972
Location: mtl_ofi_component.c:1007
Error: Function not implemented (38)
--------------------------------------------------------------------------
regardless of how many ranks I use.
The segfault is gone when I turn off ofi:
~/software/openmpi5/bin/mpirun -mca mtl ^ofi -np 2 ./osu_barrier
# OSU MPI Barrier Latency Test v7.4
# Avg Latency(us)
0.21
Is this a known problem?
The text was updated successfully, but these errors were encountered:
So I did look around and read the documentation, and found out that I should use --prtemca ras_base_launch_orted_on_hn 1. But that did not help:
mpirun --prtemca ras_base_launch_orted_on_hn 1 -np 2 ~/gpubind_pmix.sh ./osu_bibw D D
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain). This is highly
unusual; your job may behave unpredictably (and/or abort) after this.
Local host: nid007961
Location: mtl_ofi_component.c:1007
Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007961:128598:0:128598] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
Hi,
I compiled OpenMPI v5.0.5 on LUMI (Cray HPE SS11 system with AMD CPUs and GPUs). I used the
PrgEnv-gnu/8.5.0
environment and configured as./configure --prefix=/users/makrotki/software/openmpi5 --with-ofi=/opt/cray/libfabric/1.15.2.0/
I ran some OSU benchmarks and generally things look good. Point to point tests yield the same performance as Cray MPI. However, I stumbled upon a segfault in
MPI_Init
. Here, I allocated only 1 compute node throughslurm
. Then:I tried with 16 ranks and it sometimes works, sometimes segfaults. But with 2 ranks it segfaults always. Note that I always see this message:
regardless of how many ranks I use.
The segfault is gone when I turn off ofi:
Is this a known problem?
The text was updated successfully, but these errors were encountered: