mpi4py: Regression in with OMPI at branch v5.0.x #12407

dalcinl · 2024-03-13T06:06:39Z

Looks like an intercommunicator create error has slipped into branch v5.0.x
https://github.com/mpi4py/mpi4py-testing/actions/runs/8241699956/job/22548076522
https://github.com/mpi4py/mpi4py-testing/actions/runs/8257555604/job/22588319271

[fv-az1022-842:142460] [[58254,1],1] selected pml ob1, but peer [[58254,1],0] on unknown selected pml 
�
[fv-az1022-842:142460] OPAL ERROR: Unreachable in file communicator/comm.c at line 2385
[fv-az1022-842:142460] 0: Error in ompi_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR

======================================================================
ERROR: setUpClass (test_ulfm.TestULFMInter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/mpi4py-testing/mpi4py-testing/test/test_ulfm.py", line 197, in setUpClass
    INTERCOMM = MPI.Intracomm.Create_intercomm(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI/Comm.pyx", line 2336, in mpi4py.MPI.Intracomm.Create_intercomm
    with nogil: CHKERR( MPI_Intercomm_create(
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error

cc @wenduwan, I see you opened #12403.

PS: Can someone please backport the mpi4py CI tests to branch v5.0.x?

The text was updated successfully, but these errors were encountered:

wenduwan · 2024-03-13T17:24:30Z

Can someone please backport the mpi4py CI tests to branch v5.0.x?

That's on our plate. Haven't figured out the assignee yet.

wenduwan · 2024-03-13T17:33:47Z

The failure started on 3/11 after we merged c0f98b8

dalcinl · 2024-03-13T19:00:58Z

So, the submodule bump probably exposed in v5.0.x similar bugs to the ones we have been seeing on main 🤦‍♂️ .

wenduwan · 2024-03-18T20:36:03Z

@dalcinl I have problem reproducing the failure. mpi4py points to this setup hook
.

So I wrote a minimal reproducer

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
	int size, rank, COLOR, LOCAL_LEADER, REMOTE_LEADER;
	MPI_Comm intra_comm, inter_comm;
	MPI_Init(NULL, NULL);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        if (rank < size / 2) {
            COLOR = 0;
            LOCAL_LEADER = 0;
            REMOTE_LEADER = size / 2;
	} else {
            COLOR = 1;
            LOCAL_LEADER = 0;
            REMOTE_LEADER = 0;
	}
	MPI_Comm_split(MPI_COMM_WORLD, COLOR, 0, &intra_comm);
	printf("Before intercomm create\n");
	MPI_Intercomm_create(intra_comm, LOCAL_LEADER, MPI_COMM_WORLD, REMOTE_LEADER, 0, &inter_comm);
	printf("After intercomm create\n");
	MPI_Finalize();
}

Then I build and run it mpiexec -n 2 reproducer. But I don't see any issue with that.

Am I reading it wrong?

dalcinl · 2024-03-18T20:51:23Z

Your C code looks mostly the same except it is missing MPI_Comm_free(&intra_comm) after MPI_Intercomm_create() and before MPI_Finalize().

MMM... what happen if you execute the code in between Init/Finalize in a loop for at least two iterations?
Did you try putting the Python code with just the intercomm create in a script and running it?
Can you tell me the exact fork/branch/commit you are using? I would like to reproduce on my side and be back to you.

wenduwan · 2024-03-18T21:05:43Z

@dalcinl Thanks for your response. I tweaked the reproducer as suggested and it does not fail either

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
	int size, rank, COLOR, LOCAL_LEADER, REMOTE_LEADER;
	MPI_Comm intra_comm, inter_comm;
	MPI_Init(NULL, NULL);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
        if (rank < size / 2) {
            COLOR = 0;
            LOCAL_LEADER = 0;
            REMOTE_LEADER = size / 2;
	} else {
            COLOR = 1;
            LOCAL_LEADER = 0;
            REMOTE_LEADER = 0;
	}
	for (int i=0; i < 10; ++i) {
		MPI_Comm_split(MPI_COMM_WORLD, COLOR, 0, &intra_comm);
		MPI_Intercomm_create(intra_comm, LOCAL_LEADER, MPI_COMM_WORLD, REMOTE_LEADER, 0, &inter_comm);
		MPI_Comm_free(&intra_comm);
	}
	MPI_Finalize();
}

I ran the mpi4py test with mpiexec -n 2 python main/test.py but nothing more.

This is on latest v5.0.x.

dalcinl · 2024-03-19T11:54:30Z

I can reproduce the mpi4py failure locally. However, if I try mpiexec -n {2,3,4,5} python test/test_ulfm.py -v to narrow down to the previously failing test, now everything runs just fine. Therefore, there is something in Open MPI state that is screwed up by the time the ULFM tests run.

dalcinl · 2024-03-19T11:56:44Z

Next I run the following way to EXCLUDE the ULFM tests

mpiexec -n 2 python test/main.py -v -e ulfm

and then a intercommunicator failed the following different way:

testAllgatherInter (test_util_pkl5.TestMPIWorld.testAllgatherInter) ... [kw61149:407450] [[50638,1],1] selected pml ob1, but peer [[50638,1],0] on unknown selected pml ��
[kw61149:407450] OPAL ERROR: Unreachable in file ../../ompi-v5.0.x/ompi/communicator/comm.c at line 2385

dalcinl · 2024-03-19T12:00:07Z

This may be the same or similar issue being discussed in #12367. @bosilca nailed down the problem #12367 (comment), but I have no idea if anyone is investigating further or someone is working towards a fix.

rhc54 · 2024-03-19T12:02:22Z

If all those tests are executing within the same mpiexec, then there is going to be a lot of crosstalk and problems. If you really want to test each of those fairly, then you need to execute each of them in their own mpiexec instance.

dalcinl · 2024-03-19T12:19:29Z

If you really want to test each of those fairly,

I'm not sure what do you mean by "fair". That way of testing is what everyone else does. Such approach goes easy on MPI, short lived tests, a few features at at time. But mess can accumulate out of control and only pop up when users run more complex applications.

then you need to execute each of them in their own mpiexec instance.

mpi4py does almost all of its testing on a single mpiexec instance, and that has been the approach for ages. The hole point is to make sure that the internal state of the MPI library is always consistent, there are no races, etc. If there is crosstalk, then that is a bug that has to be fixed.

rhc54 · 2024-03-19T12:25:34Z

The hole point is to make sure that the internal state of the MPI library is always consistent, there are no races, etc. If there is crosstalk, then that is a bug that has to be fixed.

The MPI library is being reinitialized by every app. It is mpiexec that is being carried across apps, and mpiexec is not an MPI application - it's a completely separate runtime. So your method is asking that a runtime designed to execute a single application and then terminate must instead cleanly handle every possible failure or unclean termination of an application. That is a requirement not imposed by MPI, and causes subsequent tests to yield false positives.

It is true that PRRTE is intended to be a persistent DVM and execute multiple jobs - but PRRTE is not intended as a production runtime. It is for research. OMPI is using it strictly in the one-shot mode, which it handles very well.

What you do with mpi4py is entirely up to you - I'm just pointing out that it isn't surprising for it to yield false positives if being executed this way. Frankly, I'd expect it.

dalcinl · 2024-03-19T12:37:04Z

then terminate must instead cleanly handle every possible failure or unclean termination of an application.

I think you may have misunderstood what mpi4py's ULFM tests are about. These tests DO NOT test the failure mode at all, they only test the successful path, just to make sure that the successful path in ULFM works are there are no obvious bugs. Otherwise, what't the point of ULFM if it cannot at least run successfully when there is no failure?
Therefore, there should not be failures or unclean terminations of any application, mpiexec should run to completion without any issues.

rhc54 · 2024-03-19T13:02:29Z

Let me try to explain again. mpiexec is not an MPI application - it does not invoke MPI, it does not even link against libmpi. What you are therefore testing is the stability of mpiexec (and not the MPI library) when used in a way it was never intended to support. The requirement on mpiexec is to instantiate a single job (whether MPI or not is irrelevant), execute that job, and then terminate itself.

There is no requirement that it totally clean up at the end of a job, no requirement that it run multiple jobs sequentially.

mpiexec in OMPI is rather complex when compared to other MPIs because it is feature-rich - we support a lot of features beyond those from other MPIs. This was intentional as we have a lot of researchers using OMPI, and they request a great range of options for their work. Hence, mpiexec may not fully clean up and return to a "zero" state after every job - it is not a requirement, and we haven't spent much time on it.

As I said, PRRTE is a little different in that it is intended to be used as a persistent DVM - which is more in line with what you seem to be expecting. However, while it appears to work pretty well out in the wild, it remains still considered a non-production environment. It is entirely possible that it will have problems - and I would definitely expect it to have issues with ULFM, which is itself still very much a research feature.

So my point is just that you are using mpiexec in a mode that exceeds its design requirements, which means its behavior under those conditions isn't guaranteed. This can (and likely will) lead to false positives.

dalcinl · 2024-03-19T13:48:21Z

Let me try to explain again. mpiexec is not an MPI application - it does not invoke MPI, it does not even link against libmpi.

I'm very well aware of the distinction between mpiexec and the MPI application.

What you are therefore testing is the stability of mpiexec

How so? Are you talking SPECIFICALLY about ULFM tests? The issue I'm reporting here is not even about these ULFM tests, disabling them still triggers a failure.

So my point is just that you are using mpiexec in a mode that exceeds its design requirements

Well, we keep not understanding each other then... I'm not using mpiexec to run multiple jobs. It is always a (parallel) job at a time, I execute mpiexec -n <NP> python script.py and then python begins runnint, and runs the code in the script, which eventually initializes MPI, runs lots of tests exercising MPI features, and if there are no failures due to bugs and regressions in the MPI library, then eventually MPI_Finalize() gets called and everything ends gracefully.

From the point of view of MPI, Python is just a an MPI application that calls MPI_Init(), does a bunch of stuff (p2p/coll/rma/io and every damn thing in MPI), and ends with MPI_Finalize().

PS: When I said above that I was running mpiexec -n {2,3,4,5} ... , I mean to say that I run different command invocations, one after the other in sequence, WAITING for each to finish before the next one. In other words, that I was running the thing with different number of MPI process each time.

wenduwan · 2024-03-19T14:01:20Z

Next I run the following way to EXCLUDE the ULFM tests

mpiexec -n 2 python test/main.py -v -e ulfm

and then a intercommunicator failed the following different way:

testAllgatherInter (test_util_pkl5.TestMPIWorld.testAllgatherInter) ... [kw61149:407450] [[50638,1],1] selected pml ob1, but peer [[50638,1],0] on unknown selected pml ��
[kw61149:407450] OPAL ERROR: Unreachable in file ../../ompi-v5.0.x/ompi/communicator/comm.c at line 2385

@dalcinl This is interesting - is the failure always in testAllgatherInter or it could happen for any intercomm test?

I think we still need a minimal C reproducer for everyone to understand, same as #12367.

jsquyres · 2024-03-19T18:52:58Z

@wenduwan I'm not sure we'll get a minimal C reproducer here -- the issue could well be a complicated issue that involves running thousands of tests before running something that actually trips over the failure (but the real failure may have occurred far earlier, or this could be a race condition).

@dalcinl Question for you. I am running this manually on my own laptop, and I am seeing what I think might be a different error. There's so much output from the test.py script, and it looks like multiple MPI processes are emitting output simultaneously -- the error message from Open MPI (e.g., the stack trace) is getting mixed in with test output.

I used mpirun -n 3 --output-filename ... to separate out the output from the 3 processes into individual files. But even in a single file, I see output like this:

...snipped...
testGetErrhandler (test_exceptions.TestExcWinNull.testGetErrhandler) ... ok
testSetErrhandler (test_exceptions.TestExcWinNull.testSetErrhandler) ... ok
testGetSetErrhandlerFile (test_file.TestFileNull.testGetSetErrhandlerFile) ... Assertion failed: (0 == ret), function opal_thread_internal_mutex_lock, file threads_pthreads_mutex.h, line 115.
testCreate (test_errhandler.TestErrhandlerFile.testCreate) ... [Little-Beezle:86789] *** Process received signal ***
[Little-Beezle:86789] Signal: Abort trap: 6 (6)
[Little-Beezle:86789] Signal code:  (0)
[Little-Beezle:86789] [ 0] 0   libsystem_platform.dylib            0x0000000193b87584 _sigtramp + 56
[Little-Beezle:86789] [ 1] 0   libsystem_pthread.dylib             0x0000000193b56c20 pthread_kill + 288
...snipped...

The stdout and stderr and mixed together; there's inherent buffering race conditions in how the OS delivers stdout and stderr such that we can't guarantee to order it properly. Is there an easy way to slow down the output from test.py so that I can get a little clearer output? (I'm happy to edit test.py source code to do so -- I tried to put a few time.sleep(1) statements in a few places, but I apparently didn't guess right, because they don't seem to be having any effect)

jsquyres · 2024-03-20T01:29:54Z

We figured out my error above (#12416), and now I'm able to get past that test. If I run with np=3 in an Ubuntu VM with 2 cores, I am able to replicate the error on this issue. Will continue digging.

dalcinl · 2024-03-20T06:43:17Z

@jsquyres I just found something else:

If I run everything the following way

mpiexec -n 2 python test/main.py -v

then it fails with

...
[kw61149:870126] [[11981,1],1] selected pml ob1, but peer [[11981,1],0] on unknown selected pml ��
[kw61149:870126] OPAL ERROR: Unreachable in file ../../ompi-v5.0.x/ompi/communicator/comm.c at line 2385
[kw61149:870126] 0: Error in ompi_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR
...

However, if I exclude all of the spawn tests

mpiexec -n 2 python test/main.py -v -e spawn

then success! it runs to completion.

Recently, mpi4py testsuite disabled many spawn tests under Open MPI, but not all of them. We assumed that "low-intensity" tests (that does not try to launch too many new MPI processes) were fine to run. However, this new regresion seems to indicate that is not really the case.

For the time being, I'm going to disable all mpi4py spawn tests and forget about it. I will have to close as wontfix/dont-care any user issue or report that involves Open MPI and spawn. I will direct users to contact the Open MPI community for any and all spawn-related issues. Feel free to close this issue at any time.

PS: I've been keeping an eye on Open MPI's spawn support for a long time. I put openmpi+mpi4py under automated CI testing exactly three years ago on Mar 20221. Over all that time, I've been reporting regressions. Three years after, things keep being broken. Open MPI consistently break my tests on patch releases, this regression in v5.0.x is just an instance of what had happened over all of the v4 series! Everytime there is a regression users and Linux packagers hit my inbox to report regressions and ask for help, as if mpi4py had anything to do with it. Not that anyone should care about my feelings, but I feel quite frustrated and I really need to move on.

jsquyres · 2024-03-20T14:39:04Z

Its true that our spawn support has been getting better very slowly over time. We'll keep digging on this one; please don't close it. (Finally) Putting mpi4py on our own CI will help quite a lot.

bosilca · 2024-03-20T15:32:34Z

There is, and never was, low-intensity or high-intensity stressing here. We stress the hell out of intercommunicator creation in other context, the problem seems to only popup when spawn is involved.

dalcinl · 2024-03-20T17:23:18Z

@bosilca I was talking indeed about "high-stress" spawn tests.

I'm in a conundrum here. If I keep spawn tests active, they will keep failing and generating confusion with unrelated changes. If I instead skip spawn tests, then we may all just forgot about the issue, and things will be irremediably broken 6 months for now. So, what should I do? Keep spawn on until a resolution, or (hopefully temporarily) disable them to not mess with other developments going on?

PS: Despite my frustration and the fact that sometimes I admittedly behave like a...hole as a consequence, I hope you guys understand that my ultimate intention is to improve Open MPI and this fix long standing issues. Otherwise, I would just come here with vague complaints and sloppy copy & paste console outputs. Instead, I do my best to produce proper bug reports, reproducers, testing infrastructure, and occasional PRs when I manage to understand what I'm doing.

jsquyres · 2024-03-20T17:37:49Z

@dalcinl Perhaps you could disable them, but give us a trivial command line way to enable all the spawn tests, and we can have a PR open here (and/or even enable that cmd line option on main and/or v5.0.x to force us to fix it, but not block you in the meantime).

And yes, please know that we deeply appreciate all your effort in submitting bug reports to us over the years. You have been cited many, many times in our NEWS for finding issues (and even fixing them). We're not always as fast in fixing issues as I would like, but keeping us honest like this is tremendously important.

I was in exactly your role many years ago (in the 90s) when I wrote the MPI C++ bindings test suite. I found bugs in every single MPI implementation, and had to (continually) report to all of them. I feel your pain, and know how much work it takes to do that. All this work is definitely appreciated.

dalcinl · 2024-03-20T17:43:03Z

Perhaps you could disable them, but give us a trivial command line way to enable all the spawn tests

Would something as silly as export MPI_SPAWN=1 to enable these tests be good enough for you guys?

bosilca · 2024-03-20T17:45:30Z

@dalcinl I know we (the OMPI developers) all appreciate your efforts in helping us improving OMPI. Keep it pushing us, and we'll all get in a better place.

The issue in this particular case is our dependence on external libraries that do not necessarily have the same level of support, or the same level of interest in providing proper support for spawn. In same time, OMPI's lack of a proper testsuite for this topic did not help and today we are in the situation you are trying to help us fix. I understand what @jsquyres says above, but disabling them is a guarantee that we will remain in the same situation for the foreseeable future.

jsquyres · 2024-03-20T18:15:53Z

Would something as silly as export MPI_SPAWN=1 to enable these tests be good enough for you guys?

Sounds perfect to me.

I understand what @jsquyres says above, but disabling them is a guarantee that we will remain in the same situation for the foreseeable future.

@bosilca and I chatted in Slack: just to clarify what I mean -- I think @dalcinl should disable them by default so that he is unblocked. But we should modify our github action here to enable them so that we are forced into fixing the issue. Once it's fixed, @dalcinl can choose to re-enable them by default, if desired.

rhc54 · 2024-03-20T18:22:43Z

We need to be careful here about separating out what we mean by "spawn". There are two components to that operation:

launch of a new job. This is, and has been for a long time, working fine. I can cycle thousands of launches in the PMIx spawn test without a problem.
connect/accept between the child and parent jobs. This has long been flaky and remains a problem.

Reality is that nobody has been willing to take responsibility for the latter code. I've pushed for several years for someone to address it, even pointing out a possible path forward, but no takers. Which is cool - this is a volunteer community and there hasn't been interest in it.

I do have a draft PR (#12398) that addresses the issue, and am working on the final part of it to bring it to completion. However, I am not putting that forward as a solution - just an example someone over here can consider.

bosilca · 2024-03-20T19:36:38Z

I don't know what programming model would be satisfied with a spawn without communications between the newly spawned and the parent job, but I can assure you MPI is not one of them. Thus, MPI cannot support a half-working solution, we need either something that allows the two jobs to exchange their business card, or we need to figure out a solution on our own.

rhc54 · 2024-03-20T19:43:02Z

Sigh - that statement is so far off-base as to defy understanding. The connection info exchange isn't the issue that keeps breaking things - it is the next_cid code that is consistently causing problems. See your own issue trackers on the intercomm problem. Please quit trying to throw me under the bus for problems in the MPI layer.

bosilca · 2024-03-20T20:03:44Z

@rhc54, you are totally right, some statements defy understanding. The only thing that changes is the perspective.

And here is my perspective. We literally have tons of tests that checks the next_cid code, because this function is THE functions used to create communicators in OMPI, all types of communicators, inter and intra, dup, splits, all of them. I find it hard to believe that we never see an issue with next_cid in any other tests (and here I literally mean never), but we consistently have issues with this function as soon as we introduce spawn.

Please help me understand how do you reconcile this factual statement with your comments in all the issues related with spawning ?

rhc54 · 2024-03-20T20:45:50Z

George, I don't have to acknowledge that your claims are true, nor do I have to refute them - I only have to point at your own issues over here that state PMIx/PRRTE are not the source of the problem. Go argue with Howard. Go look at MTT which does not fail in spawn, but rather has one related failure - in MPI_Finalize with a bad comm_free.

Get off my back. Rather than throwing rocks at people and their work, I've actually been trying to help Howard fix the damn problem - why don't you?

jsquyres · 2024-04-09T17:22:51Z

FYI: v5.0.3 was released yesterday. We're pretty sure it still contains this issue (@wenduwan is going to re-verify that the issue still exists). 😦 @naughtont3 is going to dig into this and try to diagnose the specific issue, and sync up with @hppritcha to see what he has found so far.

wenduwan · 2024-04-09T23:53:51Z

I can reproduce the (not exactly same)failure on v5.0.x #12460

Saw warnings about MPI_Intercomm_create_from_groups - I remember @hppritcha was able to track this down in openpmix.

dalcinl · 2024-04-15T12:40:12Z

Saw warnings about MPI_Intercomm_create_from_groups

You may get warnings printed out for singleton init runs. These warnings are actually known errors, the MPI call fails with MPI_ERR_UNSUPPORTED_OPERATION. mpi4py is doing try: ... except: ... to skip the know failure, but it cannot get rid of the warning being printed anyway.

janjust · 2024-11-04T20:30:58Z

Fixed, pmix/prrte updated in latest v5.0.x

wenduwan added bug Target: v5.0.x labels Mar 13, 2024

wenduwan added the Severity: blocker label Mar 14, 2024

hppritcha mentioned this issue Mar 15, 2024

github action: add mpi4py to v50x #12411

Merged

janjust added Severity: blocker and removed Severity: blocker labels Apr 9, 2024

janjust closed this as completed Nov 4, 2024

mpi4py: Regression in with OMPI at branch v5.0.x #12407

mpi4py: Regression in with OMPI at branch v5.0.x #12407

Comments

dalcinl commented Mar 13, 2024

wenduwan commented Mar 13, 2024

wenduwan commented Mar 13, 2024

dalcinl commented Mar 13, 2024

wenduwan commented Mar 18, 2024

dalcinl commented Mar 18, 2024

wenduwan commented Mar 18, 2024

dalcinl commented Mar 19, 2024

dalcinl commented Mar 19, 2024

dalcinl commented Mar 19, 2024

rhc54 commented Mar 19, 2024

dalcinl commented Mar 19, 2024

rhc54 commented Mar 19, 2024

dalcinl commented Mar 19, 2024 • edited Loading

rhc54 commented Mar 19, 2024

dalcinl commented Mar 19, 2024

wenduwan commented Mar 19, 2024

jsquyres commented Mar 19, 2024

jsquyres commented Mar 20, 2024

dalcinl commented Mar 20, 2024

jsquyres commented Mar 20, 2024

bosilca commented Mar 20, 2024

dalcinl commented Mar 20, 2024

jsquyres commented Mar 20, 2024

dalcinl commented Mar 20, 2024

bosilca commented Mar 20, 2024

jsquyres commented Mar 20, 2024 • edited Loading

rhc54 commented Mar 20, 2024

bosilca commented Mar 20, 2024

rhc54 commented Mar 20, 2024

bosilca commented Mar 20, 2024

rhc54 commented Mar 20, 2024

jsquyres commented Apr 9, 2024

wenduwan commented Apr 9, 2024

dalcinl commented Apr 15, 2024 • edited Loading

janjust commented Nov 4, 2024

dalcinl commented Mar 19, 2024 •

edited

Loading

jsquyres commented Mar 20, 2024 •

edited

Loading

dalcinl commented Apr 15, 2024 •

edited

Loading