Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on Cray HPE system #12913

Open
angainor opened this issue Nov 7, 2024 · 1 comment
Open

Segfault on Cray HPE system #12913

angainor opened this issue Nov 7, 2024 · 1 comment

Comments

@angainor
Copy link

angainor commented Nov 7, 2024

Hi,

I compiled OpenMPI v5.0.5 on LUMI (Cray HPE SS11 system with AMD CPUs and GPUs). I used the PrgEnv-gnu/8.5.0 environment and configured as

./configure --prefix=/users/makrotki/software/openmpi5 --with-ofi=/opt/cray/libfabric/1.15.2.0/

I ran some OSU benchmarks and generally things look good. Point to point tests yield the same performance as Cray MPI. However, I stumbled upon a segfault in MPI_Init. Here, I allocated only 1 compute node through slurm. Then:

~/software/openmpi5/bin/mpirun -np 2 ./osu_barrier
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007955
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007955:08519] *** Process received signal ***
[nid007955:08519] Signal: Segmentation fault (11)
[nid007955:08519] Signal code: Address not mapped (1)
[nid007955:08519] Failing at address: 0x140074656e7a
[nid007955:08519] [ 0] /lib64/libpthread.so.0(+0x16910)[0x14f3d4b66910]
[nid007955:08519] [ 1] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3d0a6)[0x14f3cbe4e0a6]
[nid007955:08519] [ 2] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x3cfeb)[0x14f3cbe4dfeb]
[nid007955:08519] [ 3] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(+0x4d7ba)[0x14f3cbe5e7ba]
[nid007955:08519] [ 4] /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1(fi_fabric+0xa2)[0x14f3cbe2a172]
[nid007955:08519] [ 5] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(+0xa3db4)[0x14f3cbfb4db4]
[nid007955:08519] [ 6] /users/makrotki/software/openmpi5/lib/libopen-pal.so.80(mca_btl_base_select+0x14d)[0x14f3cbfa1ddd]
[nid007955:08519] [ 7] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_r2_component_init+0x12)[0x14f3d503d0c2]
[nid007955:08519] [ 8] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x14f3d503ae54]
[nid007955:08519] [ 9] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x27d34a)[0x14f3d51c634a]
[nid007955:08519] [10] /users/makrotki/software/openmpi5/lib/libmpi.so.40(mca_pml_base_select+0x1ce)[0x14f3d51c287e]
[nid007955:08519] [11] /users/makrotki/software/openmpi5/lib/libmpi.so.40(+0x9a92a)[0x14f3d4fe392a]
[nid007955:08519] [12] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_instance_init+0x61)[0x14f3d4fe4081]
[nid007955:08519] [13] /users/makrotki/software/openmpi5/lib/libmpi.so.40(ompi_mpi_init+0x96)[0x14f3d4fdb8b6]
[nid007955:08519] [14] /users/makrotki/software/openmpi5/lib/libmpi.so.40(MPI_Init+0x5e)[0x14f3d500d46e]
[nid007955:08519] [15] ./osu_barrier[0x40675d]
[nid007955:08519] [16] ./osu_barrier[0x402810]
[nid007955:08519] [17] /lib64/libc.so.6(__libc_start_main+0xef)[0x14f3d498e24d]
[nid007955:08519] [18] ./osu_barrier[0x402d7a]
[nid007955:08519] *** End of error message ***

I tried with 16 ranks and it sometimes works, sometimes segfaults. But with 2 ranks it segfaults always. Note that I always see this message:

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007972
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------

regardless of how many ranks I use.

The segfault is gone when I turn off ofi:

~/software/openmpi5/bin/mpirun -mca mtl ^ofi -np 2 ./osu_barrier

# OSU MPI Barrier Latency Test v7.4
# Avg Latency(us)
             0.21

Is this a known problem?

@angainor
Copy link
Author

angainor commented Nov 8, 2024

So I did look around and read the documentation, and found out that I should use --prtemca ras_base_launch_orted_on_hn 1. But that did not help:

mpirun --prtemca ras_base_launch_orted_on_hn 1 -np 2 ~/gpubind_pmix.sh ./osu_bibw D D

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: nid007961
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (38)
--------------------------------------------------------------------------
[nid007961:128598:0:128598] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant