-
Notifications
You must be signed in to change notification settings - Fork 900
Intermittent crashes inside MPI_Finalize #10117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We're seeing this also in a Sandia application, where it occurs just frequently enough that at least one job in our CI pipelines will fail on just about every run, creating a huge amount of friction for our development. Here's another set of stack traces, this time without GTest in the picture:
As noted in the linked openpmix issue, OpenMPI seems to be mis-using PMIx in ways that cause it to fail. OpenMPI 4.1.5 built from Spack as follows:
|
Might be worth checking to see if this occurs in the head of the OMPI main or v5 branches - I think they addressed the problems I cited, but not back into the v4 releases. |
We tried 5.0.0 rc 10, and ended up facing at least two different hangs and crashes. Those are getting reported. |
…nce against RTE/OPAL - Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()), - This may be the root cause of issue open-mpi#10117
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117
Hi @sethrj and @PhilMiller, I encountered a similar problem as this issue while testing another feature in OMPI, and found a potential mis-order in the finalize sequence. As I don't see any reproduction test in your report, can you recheck on your side if the PR #13073 brings any stability to your applications ? Feel free to discuss. |
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>
@lifflander could you try to pass this along to whoever is managing the jobs where this arose? I think it was EMPIRE's CI environment when they updated Trilinos. |
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>
- Append PML cleanup into the finalize of the instance domain ('ompi_instance_common_domain') before RTE/OPAL init. - The reason is RTE init (ompi_rte_init()) will call opal_init(), which in turn will set the internal tracking domain to OPAL's one ('opal_init_domain'), and this PML cleanup function would be mis-registered as belonging to 'opal_init_domain' instead of the current 'ompi_instance_common_domain'. - The consequence of such mis-registration is that: at MPI_Finalize(), this PML cleanup (*_del_procs()) will be executed by RTE; and, depending on their registration order, this may cut the grass under the feet of other running components (*_progress()) - This may be the root cause of issue open-mpi#10117 Signed-off-by: Minh Quan Ho <[email protected]>
Background information
This is moved from openpmix/openpmix#2508 . Our CI experiences roughly a 1 in 1000 chance of crashing or hanging during a normal call to MPI_Finalize at the end of our unit tests (even ones that are launched with only a single processor).
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
4.1.2 presently, but I've seen this with a couple of different OpenMPI 4 versions, with internal and external pmix, both installed with Spack. Here are two configurations (primarily differing only in OS) that fail:
and (builtin pmix)
I also got a failure on an even older MPI (2.1.6):
Please describe the system on which you are running
Details of the problem
Here are stack traces from two failure modes that crash instead of hanging. The first is inside PMIx for OpenMPI 4.1.2:
The second is a failure in OpenMPI 2.1.6 itself:
Any help would be greatly appreciated; as these errors have been driving us crazy because a 1/1000 failure rate means we never get a successful CI pipeline.
The text was updated successfully, but these errors were encountered: