Skip to content

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented May 29, 2025

Pull Request Description

PMIx launches background thread to support nonblocking operation, e.g. PMIx_Fence_nb, and event notification. MPICH always want to manage threads explicitly. Since currently we only use blocking operations, it should work without the background thread.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

PMIx launches background thread to support nonblocking operation, e.g.
PMIx_Fence, and event notification. MPICH always to want to manage
threads explicitly. Since currently we only use blocking operations, it
should work without the background thread.
@hzhou hzhou force-pushed the 2505_pmix_prog branch from 4c28add to 2bbd224 Compare May 29, 2025 21:15
@hzhou
Copy link
Contributor Author

hzhou commented May 29, 2025

test:mpich/pmi

@hzhou hzhou requested a review from raffenet May 30, 2025 14:22
Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about this attr. LGTM.

@hzhou
Copy link
Contributor Author

hzhou commented May 30, 2025

Labeled do-not-merge for now until we fully understand the impact.
Reference information from PMIx: openpmix/openpmix#3617

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2025

Hi folks. I followed the reference over here from the PMIx ticket - hope it's okay to continue discussion here? Just a few thoughts.

As I said over there, the use of the PMIx progress thread was something we (Argonne and Intel) had discussed pretty thoroughly back in the Aurora days. I recall flying over for personal meetings at least twice, plus several telecons where we went over the PMIx plans including in-depth discussion about the use/need for a progress thread. Experiments were conducted and both sides eventually agreed there was no measurable performance impact, and so we went ahead with it. Has something changed?

Shifting the progress thread from the PMIx library into the MPICH layer is certainly possible (though that path remains relatively unexercised), but it isn't clear to me what that really accomplishes? If you already have an internal progress thread, that this might make sense to at least try - if that's an MPICH option, then maybe you cover PMIx with that option? Unsure of the impact of that path.

I've given a fair amount of thought regarding optional support for operating without a progress thread. Of course, it's just software, so anything can be done - it's a question of effort. I'm not that familiar with your code base, but based on a glance through your latest release it would appear you don't utilize a very wide range of PMIx - mostly limited to the put-fence-get cycle?

If we were going to provide a "disable-progress-thread" option, we would naturally have to disable all the non-blocking APIs plus the event notification system. Easy enough to do that with a global flag so they just return "unavailable". Currently, the blocking form of an API is just a wrapper around the non-blocking form (so it holds you until the op completes), so we'd have to change that - a significant code refactoring effort, so limiting this to just a few APIs would help reduce the burden.

We would then have to add a switch (based on the global flag) to have communication go directly to send/recv instead of being queued. Not that hard as the code uses a macro to invoke send/recv, so it's mainly a question of modifying the macro and providing an appropriate landing point for the non-queued message requests.

Finally, we'd have to add a global lock to prevent someone from simultaneously calling multiple blocking APIs. Thread safety in the library is based on everything being executed sequentially in the PMIx progress thread, and now everything would be executing in whatever thread(s) the caller is using - only real solution is to use the locks to force serialization of the calls.

Of course, all of that would only apply to the client side - current servers rely on PMIx being asynchronous, so we wouldn't want to break them. Unfortunately, since servers are allowed to call the same APIs, it means a fair amount of effort is required to separate out the code paths so we only impact the client code path of each API.

So it isn't impossible - but also not a trivial amount of effort. Definitely not something to do unless there really is an imperative driver for it. Do we have one? Are you folks willing/able to help implement it (especially given my quasi-retired status)?

Finally, just a FWIW: scanning the code underlying this PR, I noticed that you have a "fence" that collects data, and a "barrier" that does not collect data. Unfortunately, those are not quite implemented correctly. PMIx_Fence defaults to collecting data, so if you don't want it to do so you need to pass the PMIX_COLLECT_DATA attribute with a "false" value in your "barrier" function. You don't need to pass any attribute in your "fence" function to have "fence" collect data.

@hzhou
Copy link
Contributor Author

hzhou commented May 31, 2025

Hi Ralph, appreciate you reaching out to us. We should schedule discussions offline sometime.

Some context -

  • So far, MPICH is only using PMIx as an alternative for PMI-1. That's why we are expecting a more-or-less similar semantics, especially on the client side.
  • We started to explore more PMIx feature this year as we add deeper MPI sessions support - driven by the malleability research from ParTec (PSMPI). cc @sonjahapp
  • We will not drop PMI-1. It is simple and stable and more reliably supported by vendors including new systems. By the way, I was quite disappointed OpenPMIX droped PMI-1 compatibility layer.
  • So the current effort is to extend PMI-1 to add selected PMIx features to support what is needed for MPI Sessions. The end goal is to maintain a dual support of PMIx and PMI-1 with parity features.
  • Aurora is currently using PMIx, but from PALS (HPE). It is only a subset of the full PMIx. And I think this will be common scenarios - vendors pick and choose a subset of PMIx rather than relying on openpmix.
  • Coming from MPI, I am trying to understand the PMIX standard. I find it a bit odd that It is rather focused on a particular implementation (openpmix) than a standard guiding implementations.
  • MPICH currently support PMIx by directly linking openpmix. But openpmix is tangled with features from very different user base than MPI's user base. It is not ideal to have openpmix as a dependency. For example, we just realized that openpmix forced MPICH to use background threads.

Regarding the use of background threads, I am more concerned on the PMIx standard aspect than the openpmix implementation. I think latter is just an implementation choice which comes with its implementation-specific scope. However, if MPICH wants to implement its own PMIx, or PMIx-parity features, do we have to also support background threads in the client?

It is an important question because MPICH always explicitly manages progress and makes explicit choice in using threads. For context, the main MPI users are still assuming MPI_THREAD_SINGLE. If we start to relying on background threads and allow access from those threads (event callbacks), then a large stack of previous design assumptions will need to be examined and potentially re-designed. As you see, it is not a light question.

And from our experience of managing progress without using background threads, we don't see the reason to require background threads in the PMIx standard. Again, we fully accept the use of background threads in a implementation.

Do you have my email or Ken Raffenetti's email? If not, we'll reach out to you. We should continue the discussion offline.

@hzhou
Copy link
Contributor Author

hzhou commented May 31, 2025

Finally, just a FWIW: scanning the code underlying this PR, I noticed that you have a "fence" that collects data, and a "barrier" that does not collect data. Unfortunately, those are not quite implemented correctly. PMIx_Fence defaults to collecting data, so if you don't want it to do so you need to pass the PMIX_COLLECT_DATA attribute with a "false" value in your "barrier" function. You don't need to pass any attribute in your "fence" function to have "fence" collect data.

Thanks for checking our code and pointing out the issues!

@rhc54
Copy link
Contributor

rhc54 commented May 31, 2025

I have Ken's email, but not yours. You are certainly welcome to contact me and setup a chance to discuss it (Ken definitely has my email).

I obviously don't necessarily agree with some of your points, and I think you're probably missing some history, but that's for the discussion. FWIW: pretty much everyone uses the OpenPMIx library, though they may only support some of the server-side functions. ParTec has pretty much everything supported, Slurm is working on extending their server, HPE is working towards a more complete (perhaps fully implemented) server, etc.

So I think the server side is a done deal. Research is more on integrating to the system scheduler and dynamic systems, with MPI kinda following along behind (e.g., through the sessions and fault tolerance stuff). You will therefore see some activity in that area to support them as PMIx is the library they've chosen to use for that purpose.

Client side libraries are continuing to adopt (even the AI/ML folks are investigating it), MPICH being about the only one I know of that tends towards lesser support. No issue, frankly - I have no problem with your continued use of PMI-1. If it serves your needs, that's great!

The fact that PMIx has more features than you need shouldn't impact dependency - that is true of every library you link against (pretty rare for anyone to use every API in a library - heavens, nobody does that with MPI either!). It would be interesting to know why you believe the broader feature set has anything to do with your use of the library.

Hope to chat soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants