You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With much effort, we've fixed up problems in Open MPI and PRRTe/pmix so that, at least on one node, all but one of our ibm/dynamic tests runs.
The last - which hangs - is the no-disconnect test. This test spawns a tree of processes. Grandparent spawning parents, then parents spawning grandkids.
The problem is those parents! They hang around and have to be killed off manually.
Why is this you may ask?
Well the reason lies in this code:
int ompi_dpm_dyn_finalize(void)
{
int i,j=0, max=0, num_dyns = 0;
ompi_dpm_disconnect_obj **objs=NULL;
ompi_communicator_t *comm=NULL;
fprintf(stderr, "process %d had %d dyncomms to disconnect\n", getpid(), ompi_comm_num_dyncomm);
if (1 <ompi_comm_num_dyncomm) {
objs = (ompi_dpm_disconnect_obj**)malloc(ompi_comm_num_dyncomm *
sizeof(ompi_dpm_disconnect_obj*));
if (NULL == objs) {
return OMPI_ERR_OUT_OF_RESOURCE;
}
max = ompi_comm_get_num_communicators();
for (i=3; i<max; i++) {
comm = ompi_comm_lookup(i);
if (NULL != comm && OMPI_COMM_IS_DYNAMIC(comm)) {
num_dyns++;
if (comm->c_name != NULL) {
fprintf(stderr, "process %d parent comm %d being counted as dynmaic %s\n", getpid(), i, comm->c_name);
}
}
}
fprintf(stderr, "process %d had %d dyncomms to disconnect num dyn comms %d\n", getpid(), ompi_comm_num_dyncomm, num_dyns);
for (i=3; i<max; i++) {
comm = ompi_comm_lookup(i);
if (NULL != comm && OMPI_COMM_IS_DYNAMIC(comm)) {
objs[j++] = disconnect_init(comm);
}
}
if (j != ompi_comm_num_dyncomm) {
cleanup_dpm_disconnect_objs(objs, j);
return OMPI_ERROR;
}
disconnect_waitall(ompi_comm_num_dyncomm, objs);
cleanup_dpm_disconnect_objs(objs, ompi_comm_num_dyncomm);
}
return OMPI_SUCCESS;
}
The grandparent and the grandchildren only have a single dyn intercomm while the parents have 3!
So, since the grandparents and the grandchildren zoom past this sync up operation since they don't think they have any dyn comms except the one from the spawn/spawned-so-set-up-parent comm, the parents hang there.
This routine has had this structure since the dawn of the OMPI era it appears. Notice the magic numbers - like the '3' in the loop over dyn comms.
Anyway, this points to a basic problem with the way OMPI is handing cleaning up dynamic communicators itself within MPI_Finalize. if the app does the disconnecting itself, this hang is not observed.
The text was updated successfully, but these errors were encountered:
There is no magic in the 3, it is skipping the 3 predefined communicators in the world model, communicators statically registered under comm_id 0, 1 and 2.
With much effort, we've fixed up problems in Open MPI and PRRTe/pmix so that, at least on one node, all but one of our ibm/dynamic tests runs.
The last - which hangs - is the
no-disconnect
test. This test spawns a tree of processes. Grandparent spawning parents, then parents spawning grandkids.The problem is those parents! They hang around and have to be killed off manually.
Why is this you may ask?
Well the reason lies in this code:
The grandparent and the grandchildren only have a single dyn intercomm while the parents have 3!
So, since the grandparents and the grandchildren zoom past this sync up operation since they don't think they have any dyn comms except the one from the spawn/spawned-so-set-up-parent comm, the parents hang there.
This routine has had this structure since the dawn of the OMPI era it appears. Notice the magic numbers - like the '3' in the loop over dyn comms.
Anyway, this points to a basic problem with the way OMPI is handing cleaning up dynamic communicators itself within MPI_Finalize. if the app does the disconnecting itself, this hang is not observed.
The text was updated successfully, but these errors were encountered: