Replies: 2 comments 4 replies
-
|
tag @sonjahapp |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @hzhou for this summary. I think we need to understand if it is of advantage for process managers and the MPI library in general (independent of the PM interface), if multiple inits and finalizes are done for different MPI Sessions. I think the questions are:
In my opinion, the answer to both questions is currently: No, not really; because there is no representation of something like an MPI Session in a PM to my knowledge (well... PMIx at least counts the inits but that's it). For the moment, I would vote for initializing and finalizing the connection to the PM only once at the beginning and the end - independent of the PM. From there, we could move forward with multiple init/finalize in the future, as PMs, their interfaces, and their capability to represent/ manage individual MPI Sessions evolve. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The session re-init introduces the situations of multiple
MPI_Init/MPI_Finalizewindows. We conventionally callPMI_InitinMPI_Initand callPMI_FinalizeinMPI_Finalize. However,PMI_Finalizewill close the PMI connection to the server. For many implementations, hydra included, this PMI connection is established at the time of launching, and once closed, it's difficult to re-establish. PR #6534 moved thePMI_Finalizeto atexit handler, so we will only close the connection at the end. However, we still call multiplePMI_Initand some implementations cannot handle mismatchedPMI_Init/PMI_Finalize. PR #6564 fixes this forPMIx. But this issue may be common toPMIv1andPMIv2as well. There is report that Slurm's PMI2 implementation has the same issue. The fix may be simple for a particular implementation, but it requires discussion in general --PMI_Init/PMI_Finalizeatexithandler.Thus, I think the right solution is to extend PMI API to differentiate the two different
init/finalizefunctionality.Alternatively, the implementation can hide this and internally handle them. This will require applications to always call
PMI_Init/PMI_Finalizeeach time, and server internally should make sure launch-time setup not get destroyed. Most of these setup will get reaped at process exit time anyway and the server just need make sure not to raise false alarms. Ref 6a103eeUnfortunately, it will take time for all implementations to update their fix. In the meantime, we have to deal with it on a case-by-case basis.
Reference:
Beta Was this translation helpful? Give feedback.
All reactions