Replies: 2 comments 4 replies
-
tag @sonjahapp |
Beta Was this translation helpful? Give feedback.
-
Thanks @hzhou for this summary. I think we need to understand if it is of advantage for process managers and the MPI library in general (independent of the PM interface), if multiple inits and finalizes are done for different MPI Sessions. I think the questions are:
In my opinion, the answer to both questions is currently: No, not really; because there is no representation of something like an MPI Session in a PM to my knowledge (well... PMIx at least counts the inits but that's it). For the moment, I would vote for initializing and finalizing the connection to the PM only once at the beginning and the end - independent of the PM. From there, we could move forward with multiple init/finalize in the future, as PMs, their interfaces, and their capability to represent/ manage individual MPI Sessions evolve. |
Beta Was this translation helpful? Give feedback.
-
The session re-init introduces the situations of multiple
MPI_Init/MPI_Finalize
windows. We conventionally callPMI_Init
inMPI_Init
and callPMI_Finalize
inMPI_Finalize
. However,PMI_Finalize
will close the PMI connection to the server. For many implementations, hydra included, this PMI connection is established at the time of launching, and once closed, it's difficult to re-establish. PR #6534 moved thePMI_Finalize
to atexit handler, so we will only close the connection at the end. However, we still call multiplePMI_Init
and some implementations cannot handle mismatchedPMI_Init/PMI_Finalize
. PR #6564 fixes this forPMIx
. But this issue may be common toPMIv1
andPMIv2
as well. There is report that Slurm's PMI2 implementation has the same issue. The fix may be simple for a particular implementation, but it requires discussion in general --PMI_Init/PMI_Finalize
atexit
handler.Thus, I think the right solution is to extend PMI API to differentiate the two different
init/finalize
functionality.Alternatively, the implementation can hide this and internally handle them. This will require applications to always call
PMI_Init/PMI_Finalize
each time, and server internally should make sure launch-time setup not get destroyed. Most of these setup will get reaped at process exit time anyway and the server just need make sure not to raise false alarms. Ref 6a103eeUnfortunately, it will take time for all implementations to update their fix. In the meantime, we have to deal with it on a case-by-case basis.
Reference:
Beta Was this translation helpful? Give feedback.
All reactions