Intermittent crashes inside MPI_Finalize #10117

sethrj · 2022-03-13T21:10:06Z

Background information

This is moved from openpmix/openpmix#2508 . Our CI experiences roughly a 1 in 1000 chance of crashing or hanging during a normal call to MPI_Finalize at the end of our unit tests (even ones that are launched with only a single processor).

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

4.1.2 presently, but I've seen this with a couple of different OpenMPI 4 versions, with internal and external pmix, both installed with Spack. Here are two configurations (primarily differing only in OS) that fail:

-- linux-centos7-x86_64 / [email protected] ----------------------
5az5n35 [email protected]~atomics~cuda~cxx~cxx_exceptions~gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker~pmi+romio~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none schedulers=none
kgmp3uz     [email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared
2lkmzcq         [email protected]
lkze2cg         [email protected]~python
az6pdin             [email protected] libs=shared,static
6s2r2ue             [email protected]~pic libs=shared,static
iyy6cky             [email protected]+optimize+pic+shared
dax2eni         [email protected]~symlinks+termlib abi=none
iebojck     [email protected]+openssl
nzhpoor         [email protected]~docs certs=system
ua64xxh     [email protected] patches=4e1d78c,62fc8a8,ff37630
jvkte5j     [email protected]
7p6hkue     [email protected]~docs+pmi_backwards_compatibility~restful

and (builtin pmix)

-- linux-rhel6-x86_64 / [email protected] -------------------------------
lt6jzov [email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker~pmi~pmix+romio~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none schedulers=none
5xkybkw     [email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared
mvud5q4         [email protected]
oxzozd2         [email protected]~python
4uefnnj             [email protected] libs=shared,static
vyzv6u3             [email protected]~pic libs=shared,static
e2lizux             [email protected]+optimize+pic+shared
wbkz7nq         [email protected]~symlinks+termlib abi=none
5i47rbn     [email protected]+openssl
eippinp         [email protected]~docs certs=system
sg5j4u6     [email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006,ff37630df599cfabf0740518b91ec8daaf18e8f288b19adaae5364dc1f6b2296
czcknwg     [email protected]

I also got a failure on an even older MPI (2.1.6):

-- linux-rhel6-x86_64 / [email protected] -------------------------------
rcdlf4z [email protected]~atomics~cuda~cxx~cxx_exceptions+gpfs~internal-hwloc~java~legacylaunchers~lustre~memchecker~pmi~pmix+romio~singularity~sqlite3+static~thread_multiple+vt+wrapper-rpath fabrics=none patches=d7f08ae74788a15662aeeeaf722e30045b212afe17e19e976d42b3411cc7bc26 schedulers=none
733oqre     [email protected]~cairo~cuda~gl~libudev+libxml2~netloc~nvml~opencl+pci~rocm+shared patches=d1d94a4af93486c88c70b79cd930979f3a2a2b5843708e8c7c1655f18b9fc694
mvud5q4         [email protected]
oxzozd2         [email protected]~python
4uefnnj             [email protected] libs=shared,static
vyzv6u3             [email protected]~pic libs=shared,static
e2lizux             [email protected]+optimize+pic+shared
wbkz7nq         [email protected]~symlinks+termlib abi=none
sg5j4u6     [email protected] patches=4e1d78cbbb85de625bad28705e748856033eaafab92a66dffd383a3d7e00cc94,62fc8a8bf7665a60e8f4c93ebbd535647cebf74198f7afafec4c085a8825c006,ff37630df599cfabf0740518b91ec8daaf18e8f288b19adaae5364dc1f6b2296
czcknwg     [email protected]

Please describe the system on which you are running

Operating system/version: centos7/rhel6
Computer hardware: zen2/broadwell
Network type: local

Details of the problem

Here are stack traces from two failure modes that crash instead of hanging. The first is inside PMIx for OpenMPI 4.1.2:

* thread #1, name = 'tstJoin', stop reason = signal SIGSEGV
  * frame #0: 0x00007faaabf01e08 libpmix.so.2`pmix_notify_check_range + 8
    frame #1: 0x00007faaabf032c8 libpmix.so.2`cycle_events + 1224
    frame #2: 0x00007faaab687c49 libevent_core-2.1.so.7`event_process_active_single_queue(base=0x000000000141cbf0, activeq=0x000000000141a000, max_to_process=2147483647, endtime=0x0000000000000000) at event.c:1691:4
    frame #3: 0x00007faaab68850f libevent_core-2.1.so.7`event_base_loop at event.c:1783:9
    frame #4: 0x00007faaab688454 libevent_core-2.1.so.7`event_base_loop(base=0x000000000141cbf0, flags=<unavailable>) at event.c:2006:12
    frame #5: 0x00007faaabf9ef2e libpmix.so.2`progress_engine + 30
    frame #6: 0x00007faab7900ea5 libpthread.so.0`start_thread + 197
    frame #7: 0x00007faab656fb0d libc.so.6`__clone + 109
* thread #2, stop reason = signal 0
  * frame #0: 0x00007faab7904a35 libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 + 197
    frame #1: 0x00007faaabf354cc libpmix.so.2`PMIx_Finalize + 1052
    frame #2: 0x00007faaac3d6622 libopen-pal.so.40`ext3x_client_finalize + 898
    frame #3: 0x00007faaac6b0d75 libopen-rte.so.40`rte_finalize + 85
    frame #4: 0x00007faaac665852 libopen-rte.so.40`orte_finalize + 98
    frame #5: 0x00007faab7dc2aa2 libmpi.so.40`ompi_mpi_finalize + 2274
    frame #6: 0x00007faab83a4fd2 libnemesis.so`nemesis::finalize() at Functions_MPI.cc:161:17
    frame #7: 0x00007faab88a6b86 libNemesisGtest.so`nemesis::gtest_main(argc=<unavailable>, argv=<unavailable>) at Gtest_Functions.cc:285:22

The second is a failure in OpenMPI 2.1.6 itself:

* thread #1, name = 'tstS_Graph', stop reason = signal SIGSEGV
  * frame #0: 0x00007f0c74f9c5c4 libopen-pal.so.20`opal_bitmap_set_bit + 164
    frame #1: 0x00007f0c75127f08 libopen-rte.so.20`mca_oob_usock_component_set_module + 168
    frame #2: 0x00007f0c74ff2479 libopen-pal.so.20`opal_libevent2022_event_base_loop at event.c:1370:5
    frame #3: 0x00007f0c74ff23ec libopen-pal.so.20`opal_libevent2022_event_base_loop at event.c:1440
    frame #4: 0x00007f0c74ff2398 libopen-pal.so.20`opal_libevent2022_event_base_loop(base=0x0000000000a64a30, flags=1) at event.c:1644
    frame #5: 0x00007f0c74fa73ae libopen-pal.so.20`progress_engine + 30
    frame #6: 0x000000351da07aa1 libpthread.so.0`start_thread + 209
    frame #7: 0x000000351d6e8c4d libc.so.6`__clone + 109
* thread #2, stop reason = signal SIGSEGV
  * frame #0: 0x000000351da082fd libpthread.so.0`pthread_join + 269
    frame #1: 0x00007f0c74fa7c2d libopen-pal.so.20`opal_thread_join + 13
    frame #2: 0x00007f0c74fa795c libopen-pal.so.20`opal_progress_thread_finalize + 284
    frame #3: 0x00007f0c7510bb17 libopen-rte.so.20`rte_finalize + 103
    frame #4: 0x00007f0c750cb3ec libopen-rte.so.20`orte_finalize + 92
    frame #5: 0x00007f0c77d2bf02 libmpi.so.20`ompi_mpi_finalize + 1442
    frame #6: 0x00007f0c7e1ab3f2 libNemesis.so.07`nemesis::finalize() at Functions_MPI.cc:161:17
    frame #7: 0x00007f0c7e6c5ac6 libNemesis_gtest.so`nemesis::gtest_main(argc=<unavailable>, argv=<unavailable>) at Gtest_Functions.cc:285:22

Any help would be greatly appreciated; as these errors have been driving us crazy because a 1/1000 failure rate means we never get a successful CI pipeline.

The text was updated successfully, but these errors were encountered:

PhilMiller · 2023-03-29T21:58:36Z

We're seeing this also in a Sandia application, where it occurs just frequently enough that at least one job in our CI pipelines will fail on just about every run, creating a huge amount of friction for our development.

Here's another set of stack traces, this time without GTest in the picture:

Core was generated by `./Tempus_UnitTest_ERK_Trapezoidal.exe'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fdf69dcf388 in pmix_notify_check_range (rng=rng@entry=0x7fdf00000151, proc=proc@entry=0x2875d5c) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/event/pmix_event_notification.c:1223
1223                       if (PMIX_RANGE_UNDEF == rng->range ||


(gdb) info threads
  Id   Target Id         Frame
  3    Thread 0x7fdf6739f700 (LWP 52507) 0x00007fdf6cddbddd in poll () from /lib64/libc.so.6
  2    Thread 0x7fdf6efb3880 (LWP 52503) 0x00007fdf6d0c1a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1    Thread 0x7fdf66b9e700 (LWP 52509) 0x00007fdf69dcf388 in pmix_notify_check_range (rng=rng@entry=0x7fdf00000151, proc=proc@entry=0x2875d5c)
    at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/event/pmix_event_notification.c:1223



(gdb) bt
#0  0x00007fdf69dcf388 in pmix_notify_check_range (rng=rng@entry=0x7fdf00000151, proc=proc@entry=0x2875d5c) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/event/pmix_event_notification.c:1223
#1  0x00007fdf69dd0757 in cycle_events (sd=<optimized out>, args=<optimized out>, cbdata=0x2875ca0) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/event/pmix_event_notification.c:452
#2  0x00007fdf69cfcd5b in event_process_active_single_queue (activeq=0x2822d70, base=0x2825970) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/event/libevent2022/libevent/event.c:1370
#3  event_process_active (base=<optimized out>) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/event/libevent2022/libevent/event.c:1440
#4  opal_libevent2022_event_base_loop (base=0x2825970, flags=flags@entry=1) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/event/libevent2022/libevent/event.c:1644
#5  0x00007fdf69df1f9e in progress_engine (obj=<optimized out>) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/runtime/pmix_progress_threads.c:232
#6  0x00007fdf6d0bdea5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007fdf6cde6b0d in clone () from /lib64/libc.so.6

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fdf6efb3880 (LWP 52503))]
#0  0x00007fdf6d0c1a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fdf6d0c1a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fdf69dbb78b in PMIx_Finalize (info=info@entry=0x0, ninfo=ninfo@entry=0) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix/src/client/pmix_client.c:978
#2  0x00007fdf69d4f46a in pmix3x_client_finalize () at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/pmix/pmix3x/pmix3x_client.c:199
#3  0x00007fdf6a1641c5 in rte_finalize () at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/orte/mca/ess/pmi/ess_pmi_module.c:650
#4  0x00007fdf6a1193b1 in orte_finalize () at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/orte/runtime/orte_finalize.c:77
#5  0x00007fdf6d935ecc in ompi_mpi_finalize () at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/ompi/runtime/ompi_mpi_finalize.c:478
#6  0x0000000000d64201 in Teuchos::GlobalMPISession::~GlobalMPISession() ()
#7  0x000000000049e5fd in main ()

(gdb) thread 3
[Switching to thread 3 (Thread 0x7fdf6739f700 (LWP 52507))]
#0  0x00007fdf6cddbddd in poll () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fdf6cddbddd in poll () from /lib64/libc.so.6
#1  0x00007fdf69d050ed in poll_dispatch (base=0x2803350, tv=<optimized out>) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/event/libevent2022/libevent/poll.c:165
#2  0x00007fdf69cfc7f2 in opal_libevent2022_event_base_loop (base=0x2803350, flags=flags@entry=1) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/mca/event/libevent2022/libevent/event.c:1630
#3  0x00007fdf69ca7f5e in progress_engine (obj=<optimized out>) at /build/tpl/x86_64/openmpi-4.1.5_debug/src/openmpi-4.1.5/opal/runtime/opal_progress_threads.c:105
#4  0x00007fdf6d0bdea5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fdf6cde6b0d in clone () from /lib64/libc.so.6

As noted in the linked openpmix issue, OpenMPI seems to be mis-using PMIx in ways that cause it to fail.

OpenMPI 4.1.5 built from Spack as follows:

$ ompi_info
                 Package: Open MPI dcollin@ascicgpu042 Distribution
                Open MPI: 4.1.5
  Open MPI repo revision: v4.1.5
   Open MPI release date: Feb 23, 2023
                Open RTE: 4.1.5
  Open RTE repo revision: v4.1.5
   Open RTE release date: Feb 23, 2023
                    OPAL: 4.1.5
      OPAL repo revision: v4.1.5
       OPAL release date: Feb 23, 2023
                 MPI API: 3.1.0
            Ident string: 4.1.5
 Configured architecture: x86_64-pc-linux-gnu
  Configure command line: 'CC=/projects/cde/v3/cee/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/gcc-10.3.0-bxx4ufppf7l3xlbvdinnusa4e6ykqok5/bin/gcc'
                          'CFLAGS=-g -O3 -fPIC'
                          'FC=/projects/cde/v3/cee/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/gcc-10.3.0-bxx4ufppf7l3xlbvdinnusa4e6ykqok5/bin/gfortran'
                          'FCFLAGS=-g -O3 -fPIC' '--with-hwloc=internal'
                          '--with-libevent=internal' '--enable-shared'
                          '--enable-static' '--without-slurm'
                          '--without-gpfs' '--without-cma' '--without-verbs'
                          '--without-ucx' '--without-usnic' '--without-knem'
                          '--without-ofi' '--without-lustre'
                          '--with-libmpi-name=mpi'
                          '--prefix=/projects/empire/tpl/x86_64/openmpi-4.1.5_debug'
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the
                          /projects/cde/v3/cee/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/gcc-10.3.0-bxx4ufppf7l3xlbvdinnusa4e6ykqok5/bin/gfortran
                          compiler and/or Open MPI, does not support the
                          following: array subsections, direct passthru
                          (where possible) to underlying Open MPI's C
                          functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: /projects/cde/v3/cee/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/gcc-10.3.0-bxx4ufppf7l3xlbvdinnusa4e6ykqok5/bin/gcc
     C compiler absolute: 
  C compiler family name: GNU
      C compiler version: 10.3.0
            C++ compiler: /projects/cde/v3/cee/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/gcc-10.3.0-bxx4ufppf7l3xlbvdinnusa4e6ykqok5/bin/g++
   C++ compiler absolute: none
           Fort compiler: /projects/cde/v3/cee/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/gcc-10.3.0-bxx4ufppf7l3xlbvdinnusa4e6ykqok5/bin/gfortran
[snip]
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
[snip]

rhc54 · 2023-03-29T22:34:42Z

Might be worth checking to see if this occurs in the head of the OMPI main or v5 branches - I think they addressed the problems I cited, but not back into the v4 releases.

PhilMiller · 2023-04-04T18:20:20Z

We tried 5.0.0 rc 10, and ended up facing at least two different hangs and crashes. Those are getting reported.

…nce against RTE/OPAL - Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()), - This may be the root cause of issue open-mpi#10117

- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117

hominhquan · 2025-01-31T09:32:51Z

Hi @sethrj and @PhilMiller,

I encountered a similar problem as this issue while testing another feature in OMPI, and found a potential mis-order in the finalize sequence.

As I don't see any reproduction test in your report, can you recheck on your side if the PR #13073 brings any stability to your applications ?

Feel free to discuss.

- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>

PhilMiller · 2025-01-31T16:08:16Z

@lifflander could you try to pass this along to whoever is managing the jobs where this arose? I think it was EMPIRE's CI environment when they updated Trilinos.

- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>

bwbarrett added bug Target: v4.1.x labels Mar 16, 2022

hominhquan mentioned this issue Jun 7, 2024

Mystery error on exit #12607

Closed

hominhquan mentioned this issue Jan 31, 2025

ompi/instance: fix cleanup function registration order #13073

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent crashes inside MPI_Finalize #10117

Intermittent crashes inside MPI_Finalize #10117

sethrj commented Mar 13, 2022 •

edited

Loading

PhilMiller commented Mar 29, 2023

rhc54 commented Mar 29, 2023

PhilMiller commented Apr 4, 2023

hominhquan commented Jan 31, 2025

PhilMiller commented Jan 31, 2025

Intermittent crashes inside MPI_Finalize #10117

Intermittent crashes inside MPI_Finalize #10117

Comments

sethrj commented Mar 13, 2022 • edited Loading

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Please describe the system on which you are running

Details of the problem

PhilMiller commented Mar 29, 2023

rhc54 commented Mar 29, 2023

PhilMiller commented Apr 4, 2023

hominhquan commented Jan 31, 2025

PhilMiller commented Jan 31, 2025

sethrj commented Mar 13, 2022 •

edited

Loading