MPIX_Stream #5908

hzhou · 2022-03-26T16:20:35Z

hzhou
Mar 26, 2022
Maintainer

`MPIX_Stream`

Background motivation

MPI is ambiguous on its multi-thread model.
- Most of the text implies a serial model --
  
  (with `MPI_THREAD_MULTIPLE) the outcome will be as if the calls were executed in some order.
  -- MPI report 4.0
  
  Yet, sporadically, the text also emphasizes its intention to encourage concurrency in MPI+thread.
  Concurrency in MPI happens in the same way as compiler optimization of a serial code, guided by outcomes.
  The default behavior of a typical implementation is to use global critical section, thus all MPI operations are serialized. Then complicated and smart implementations tries its magic to skip or yield the big lock whenever it can get away with it.
- This ambiguity destroys MPI+thread performance -- extra thread synchronization is always required even when it is okay to skip.
- However, the user almost always knows when it is okay to skip thread synchronization. Best multi-thread application always designed to avoid/minimize thread synchronization.
- User can't tell MPI because MPI does not have or refuse to acknowledge the thread concept -- "the outcome will be as if the calls executed in some order"
GPU application
- Prevalent GPU programming model is "stream enqueue" -> "queue/graph optimization" -> "stream synchronization"
- A GPU stream is again a serial execution context.
- Directly passing GPU stream into MPI is hacky and it is bound to confuse/conflict with the existing API, which does not promise a serial execution context.
Without MPI concept of "serial operation context", both application and implementations are complicated, convoluted, and under-performant
With the MPI concept and entity, both applications and implementations can be simple and direct, achieving ideal MPI+thread/gpu performance.

`MPIX_Stream`

First thing first, we need an MPIX_Stream - a serial operation context:
- MPIX_Stream_create(MPI_Info info, MPIX_Stream *stream)
  
  Even with MPI_INFO_NULL, an MPIX_Stream distinguishes independent serial operation context. It provides a facility for users to attach another runtime system's serial context to MPI for interoperability. For example, user can set info "type"="cudaStream_t", "id"=cuda_stream to have a cuda-aware implementation to inter-op with cuda applications.
- MPIX_STREAM_NULL represents the default stream all the traditional MPI APIs use.
  - On the default stream, "the outcome will be as if the calls executed in some order."
  - Implementation will try its best to preserve/promote concurrency within the boundary of MPI matching semantics.
- Passing an opaque stream variable via info hints require consistent encoding/decoding. We provide

MPIX_Info_set_hex(info, "id", &stream, sizeof(stream);

 to reliably pass a binary value into MPI via info hints.

As long as MPI is concerned, operations on different streams (including the default stream) are concurrent.
NOTE: MPIX_Stream is local regarding its serial context. However, in communication, pairing and matching between the origin stream and the target stream are often significant. Imagine when a stream accidentally receives a message destined to another stream, forcing an inter-stream synchronization that it's desirable to avoid.
The addition of MPIX_Stream essentially allows MPI_THREAD_SERIALIZED (on a stream) within MPI_THREAD_MULTIPLE globally.
Implementor note: If there isn't resource available (VCI, including no VCI support), return failure. If the specified type is not supported, return fail.

Stream Communicator

We use the stream to setup "Stream Communicators" -
- MPIX_Stream_comm_create(MPI_Comm parent_comm, MPIX_Stream stream, MPI_Comm *stream_comm);
  
  Each process can specify its own participating stream (or MPIX_STREAM_NULL).
  
  YES! This is essentially a revival of the endpoint proposal, without the complication of endpoints.
All traditional MPI operation works on stream communicators, with a strong serial context. It is illegal to issue concurrent operations on a stream communicator!
We support a set of "enqueue" operations with a stream communicator and on processes with non-default stream attached.
- MPIX_Send_enqueue
- MPIX_Recv_enqueue
- ...
Alternatively, if the attached stream is an enqueue-only stream, e.g. cudaStream_t, then we can make all MPI operations on this stream communicator (per local process) is an implicit enqueue operation. Effectively, MPIX_Send_enqueue is simply an alias to MPIX_Send, but more explicitly.

Stream Multiplex Communicator

What it can't achieve (but the old endpoints proposal does) is an "N-to-1" communication pattern (common in tasks/dispatch systems):
What we needed is a stream multiplex communicator (this is a bit hairy, but it is very specific, so hopefully some complexity is allowed to address a complicated scenario)--

MPIX_Stream_comm_create_multiplex(MPI_Comm parent_comm, int num_streams, MPIX_Stream streams[], MPI_Comm *multiplex_comm)

To use it --

MPIX_Stream_send(buf, dst_rank, tag, multiplx_comm, src_index, dst_index);
MPIX_Stream_recv(buf, src_rank, tag, multiplx_comm, src_index, dst_index);

In receive, src_index can be "any" (-1).

It's essentially an endpoints communicator, but address it with rank+index rather than an endpoint rank. Users do not need to maintain the rank/thread mapping.

Rationale and Restrictions

Hopefully, this is a series of (intuitive) generalizations --

Normal stream communicator can be thought of as a special multiplex communicator where src_index and dst_index are both 0.
Normal (conventional) communicator is a special case of stream communicator with default stream.

Restrictions on multiplex communicators:

The only way to communicate with additional streams in a multiplex communicator is via the above point-to-point send/recv.
When a multiplex communicator is used in other operations including collectives and window creation, only the first stream participates, essentially reducing to a normal stream communicator.

Restrictions on mixing "enqueue" and immediate operations on a stream

The Biggest Kick
With stream communicator, it's possible for an implementation to safely remove all internal locks, thus achieving ideal MPI+thread performance.

NOTE: prototype in PR #5906. In particular, example code: https://github.com/pmodels/mpich/blob/fa71d8970e8014765e398f8ae66ff763ef8d1ab3/test/mpi/impls/mpich/gpu/stream.cu

hzhou · 2022-03-26T16:47:34Z

hzhou
Mar 26, 2022
Maintainer Author

cc @jdinan @yfguo @raffenet @rsth

0 replies

Wee-Free-Scot · 2022-03-28T08:24:49Z

Wee-Free-Scot
Mar 28, 2022

This is an interesting contribution, thanks!

I see two forms of prior art that need to be carefully thought through and compared with this proposal. A brand new API has the advantage of complete freedom but also the disadvantage of adding a whole new API with new/different semantics that are in addition to the extant knowledge and expertise.

The first prior art is mentioned in the proposal itself: the endpoints proposal (and spin-offs like the finepoints proposal, which is now partitioned communication in MPI-4.0).

The second prior art is MPI Sessions (version 1 of which is now in MPI-4.0). This link is not immediately apparent, I guess.

The sessions proposal permits thread support level to be specified for each session and to be different for each session. The isolation of sessions is intended to mean that questions like "what happens if there are concurrent calls using different sessions?" are easily answered with "sessions are isolated from each other; there are no cross-session threading problems" (which might not actually work out in practice, but was/is the intent).

I think, therefore, that it is worth comparing your MPIX_STREAM with MPI_Session(support_level:=MPI_THREAD_SERIALIZED).

All procedure calls that use an MPI object derived from such as "serialised session" are guaranteed (by the applicable thread support level) to be serialised w.r.t. each other (but not, necessarily, w.r.t. other MPI calls that use MPI objects derived from other sessions). I think we already have the serial execution context desired for MPIX_STREAM.

The global thread support level (equivalently, the cross-session thread support level) is not defined in MPI-4.0 -- mostly intentionally, because it should not be necessary. Given the existing "outcome as if called in order" general proclamation for all MPI calls globally within the MPI process and the isolation of sessions, it should already be clear what it permitted, what is prohibited, and what the outcome should be for any execution pattern. This is a big assertion -- as we become aware of counter-examples to this, then we should strive to plug the holes in the current definitions and specification text. I think we already have the MPI_THREAD_MULTIPLE global execution context desired for MPIX_STREAM.

We should look very carefully at the difference between the programming model context and the execution model context. The mapping from one to the other (by compilers) generally depends on visible side-effects, i.e. outcomes (as mentioned above). Re-ordering (by the compiler or OS/execution runtime) of the execution of MPI procedure calls that use objects derived from the same MPI session should be disallowed. Interweaving of these sequential "streams" of serialised procedure calls is permitted and encouraged if it improves performance. We would need to look carefully at fairness: drain session1 commands, then look at session2 commands might not be a good implementation choice, even if it is a legal choice.

The multiplexing to emulate N-to-1 endpoints suggests an MPI session using MPI_THREAD_MULTIPLE as its thread support level -- the scope of its thread level is still limited to MPI procedure calls using an MPI object derived from the MPI session but now it must accept concurrent calls (from any thread/execution context) using those MPI objects. This requires the implementation to multiplex for that MPI session, even though it is not required to multiplex for other MPI sessions (if they have a lower thread support level).

The isolation of MPI sessions seems like the weakest point of this whole edifice. Are they isolated enough to achieve the goal?

6 replies

hzhou Mar 28, 2022
Maintainer Author

@Wee-Free-Scot We obviously have not spent as much thought on sessions. It's good you bring it up! I have a couple quick points comparing sessions and streams now.

While both a stream and a session are local, a communicator is not. A stream is contained in a communicator, thus it easily allows one process to communicate from a serialized stream to another process's default "thread-multiple" stream. In contrast, the session contains communicators -- and I believe, nearly every other MPI object, right? Can we communicate from one process's serial context to another process's concurrent context using sessions? The N-to-1 example is not the same as MPI_THREAD_MULTIPLE. Each MPI_Stream_send/MPI_Stream_recv is a very specific "thread-serial" (unless MPIX_STREAM_NULL is used).

My second point is specificity and ease of use. This may be a secondary consideration from MPI's point of view, but I argue it is probably the top consideration from a user's point of view. A stream is very specific, it is just a serial execution context. Its goal is to capture the same ideas that are already being used by users outside MPI -- thread (kernel thread or user thread), CUDA stream, etc. It tries to provide the same idea but is well defined within MPI. A user who is already using e.g. threads or CUDA does not need to learn a new idea. They will naturally assume an MPI stream represents their thread or CUDA stream and once mapped, use it the same way and accept the same consequences. At least this will be the goal behind the MPI stream proposal. On the other hand, a session is complicated to understand. Users may naturally take an MPI Session as a separate programming context, mapping to a library. But quickly they'll discover that session allows them to mix usage even in the same function to achieve certain outcomes. Honestly, even as implementers, we still struggle to grasp the understanding of sessions. Thus, I do not expect a session is easy for users to adopt even when it can meet their goal.

hzhou Mar 28, 2022
Maintainer Author

The global thread support level (equivalently, the cross-session thread support level) is not defined in MPI-4.0 -- mostly intentionally, because it should not be necessary. Given the existing "outcome as if called in order" general proclamation for all MPI calls globally within the MPI process and the isolation of sessions, it should already be clear what it permitted, what is prohibited, and what the outcome should be for any execution pattern. This is a big assertion -- as we become aware of counter-examples to this, then we should strive to plug the holes in the current definitions and specification text. I think we already have the MPI_THREAD_MULTIPLE global execution context desired for MPIX_STREAM.

Agreed. Just to note that MPIX_STREAM_NULL is meant to cover the backward compatibility rather than a new feature. When the traditional model of MPI_COMM_WORLD + MPI_THREAD_MULTIPLE works for your application (even for part of your application), keep using it.

hzhou Mar 28, 2022
Maintainer Author

We should look very carefully at the difference between the programming model context and the execution model context. The mapping from one to the other (by compilers) generally depends on visible side-effects, i.e. outcomes (as mentioned above).

Mapping by outcomes is a good way to describe the current approach of MPI_THREAD_MULTIPLE. Just like compiler optimizations, to user, this is magic. To implementors, this is messy art. "Magic" is difficult to use, often does not work. "Art" is difficult to perfect and repeat.

MPIX stream is mapping by explicit intention/declaration.

hzhou Mar 28, 2022
Maintainer Author

The isolation of MPI sessions seems like the weakest point of this whole edifice. Are they isolated enough to achieve the goal?

During our research, we raised this question multiple times -- since we were never able to beat "MPI everywhere" (using MPI_THREAD_SINGLE with multiple processes on each node), shouldn't the conclusion be "not to use MPI+thread"? With the ease of shared memory and inter-process kernel enhancements, "MPI everywhere" might just be a better model.

But the reality is, MPI will miss the wagon before we convince the users. It is not about whether MPI's process model can meet user's goal. It is about how MPI can meet the multi-thread model that users are already using.

A virtual process (with its isolation) is not the answer. Users choose multiple threads explicit desires the lack of isolation, or the convenience because of it. So even when MPI sessions can meet users' goal of expressing their thread context, if users have to go through multiple sets of bootstrapping and bookkeeping to keep track of thread/session/communicator mapping, they are likely not to use it or even learn it in the first place.

hzhou Mar 28, 2022
Maintainer Author

Could this also more explicitly bridge the camp A/camp B issue? Could the standard, for example, state that THREAD_MULTIPLE does have a global order inside a session/stream which MPI must observe/maintain and I suspect most programmers expect (and coincidentally all current implementations provide IIRC), while sessions/streams provides the concurrency isolation that camp B wants for very high performance contention-free parallel messaging?

@patrickb314 I think so (in some way, not directly). We have camp A/camp B debate because we can't have the cake and eat it too. Once we provide additional cakes (e.g. MPIX stream), I think the original question that divides camp A and camp B become less important.
For example, an "all-serializing" MPI_THREAD_MULTIPLE might be acceptable.

Wee-Free-Scot · 2022-03-28T15:27:02Z

Wee-Free-Scot
Mar 28, 2022

@hzhou these are good questions/comments.

First, the local session here has no direct relationship with any remote session at some other MPI process. As you say, it is a local object. My session here might request and be provided the MPI_THREAD_MULTIPLE thread support level, whereas your session over there might request and be provided the MPI_THREAD_FUNNELLED thread support level. I can use any thread with a communicator derived from my local session to send a message to you. You are required to use a particular thread to receive those messages because your communicator was derived from your local session. We have an N-to-1 thread context marshalling/multiplexing situation. Does this address your comment or clarify anything for you?

Second, there are (at least) two separate existing conceptual interpretations/scopes for any serial context that some make sense here. I think MPIX_STREAM is proposing a third scope, which is not ideal if one of the existing scopes is sufficient.

the equivalent of specifying MPI_THREAD_SINGLE -- every action is serialised because there is only one thread. The one thread can interact with any MPI object, but it must only interact with one mutable MPI object at a time (broadly, except for APIs that take more than one MPI object as input, which fit the pattern of one output/mutable object and one or more input/read-only objects). The serial context is clearly bigger than any one MPI object, but applies its serialisation constraint across all in-scope objects.
the sub-contexts emerging from the matching rules, which guarantee that some MPI operations cannot have any visible semantic effect on some other MPI operations and can therefore be parallelised without serialisation between the two groups of operations (even if ordering must be maintained within each grouping, the two groups can be interleaved in any way the implementation chooses). Sometimes this sub-context is smaller than an MPI object (a communicator, for example) -- such as the case when using different explicit tags at the receiver (helped along with the mpi_no_any_tag info key). Sometimes this sub-context is bigger than an MPI object -- such as when using different communicators to achieve parallelism (meaning to achieve a situation where one set/stream of messages are independently ordered from another set/stream of messages). The rules for this kind of parallelisable sub-context are complex and depend on the matching/ordering rules, which are ambiguous.
the MPIX_STREAM approach, which seems to be explicitly limited to within the scope of a single MPI object (c.f. "A stream is contained in a communicator") -- within one MPIX_STREAM, it is not permitted to issue a call to MPI_Send using one communicator followed by a call to MPI_Recv on another communicator because that second call must be issued in the context of a different MPIX_STREAM because it uses a different MPI object. This seems weird to me -- I have probably mis-interpreted your comments.

Your second point is aligned with good design principles, specifically things "should do one thing and do it well". Sessions is a bit of a monster, a bandwagon that has accreted all kinds of additional baggage during its journey and shows no sign of stopping. On the other hand, I can see both a super-context (bigger than or equal to a single MPI object) and a sub-context (smaller than or equal to a single MPI object) scope for serialisation/ordering/isolation. Endpoints were a super-context scope; thread support levels are super-context modifiers; sessions are super-context scope; MPIX_STREAM seems to be sub-context scope, which is otherwise missing. Is such a scope useful and/or essential for users and/or implementors of MPI?

1 reply

hzhou Mar 28, 2022
Maintainer Author

I will likely review your comments multiple times and may evolve my thought and reply multiple times. Here are a few quick notes.

I agree that one can use the session as if it is a local stream -- I got to this after I posted my question and before your reply. It may be weird for such session usage since one needs to call MPI_Session_init multiple times and each goes through a communicator bootstrapping process.
Clarify on N-to-1 communication, this is what we try to avoid:
```
  for i = 0 : comms.size():
      MPI_Irecv(buf[i], ..., MPI_ANY_SRC, comms[i], reqs[I])
  while listening:
      for i = 0 : comms.size():
           MPI_Test(reqs[I], ...)
```
~~Can we even create multiple communicators from a single local session but connected to different sessions in another process?~~(We can create multiple MPI_THREAD_SERIALIZED sessions, each create a comm, and use them in a single polling loop). Then --
For one, there is this burden of extra communicator maintenance. For two, multiple progress polls in each loop. I am not sure if we can use MPI_Testany on requests from multiple sessions since different sessions meant to be isolated. If it is allowed, there are potential optimizations. But even then, the message ordering on different comms is missing.
The serial context is clearly bigger than any one MPI object, but applies its serialization constraint across all in-scope objects.

I think this is a trap (that a bigger scope) that the MPI community falls into. The serial context is an orthogonal concept to concepts behind other MPI objects. A thread typically does not contain objects and objects can typically be accessed across threads. Similarly for CUDA streams. This is notably different from a "process". I think this is the core reason why the MPI thread-level concept failed to address the need for MPI+thread.
I mis-commented in "a stream is contained in a communicator". A stream is attached to communicators, not contained. A stream can be attached to multiple communicators.
MPIX_STREAM seems to be sub-context scope, which is otherwise missing. Is such a scope useful and/or essential for users and/or implementors of MPI?

As earlier comments, I am not sure "sub-context scope" is the right interpretation.
I certainly think this missing concept is useful and essential, for both users and implementors. We already use threads and streams and it is useful to tell MPI and map between users and implementations. It is essential as it is currently missing, at least in concept.

yfguo · 2022-04-07T20:14:27Z

yfguo
Apr 7, 2022
Maintainer

Just adding notes from the discussion with Jim:

What happens if the user knowingly or unknowingly dup the stream comm?

1 reply

hzhou Apr 7, 2022
Maintainer Author

If they dup a stream comm, they get a stream comm. We can help users to avoid confusion by making stream comm an explicit "flavor" -- similar to "intra-comm" and "inter-comm", although the stream flavor will be an add-on flavor, i.e. there will be "stream intra-comm" and "stream inter-comm".

hzhou · 2022-04-08T02:01:34Z

hzhou
Apr 8, 2022
Maintainer Author

A big part of the discussion is how do we support CUDA graph or whether MPI should support CUDA graph. Semantically, to support CUDA graph, we need a function something like:

    MPIX_Send_task_create(..., task_type, &cuda_graph);

Unlike a stream, which to a certain abstraction, is a serial execution context; a CUDA graph is in semantic a task to be inserted into a stream or as a node to be inserted into an execution graph. Thus, if we remove the specific reference to CUDA, it can be generalized as a task.

The issue for task is it is tied to the task runtime. A CUDA graph will only have meaning to a CUDA runtime. An OpenMP task will only have meaning with OpenMP. The creation of the CUDA graph has to be via CUDA API. Similarly an OpenMP task. I am phrasing this as such to raise the question that do we really want to handle the task creation inside MPI or can we just leave it to user or to a task runtime library that sits on top of MPI?

@jdinan From your experience of building a prototype on top of MPI, what's the advantage to move the CUDA graph creation inside an MPI implementation versus do it in a library above MPI?

0 replies

hzhou · 2022-04-08T02:15:32Z

hzhou
Apr 8, 2022
Maintainer Author

Second thought on MPIX_Send_enqueue --

We'll need one such "enqueue" function for every operation current MPI has, that shall give every one a pause. One alternative is to let go of the generalization of enqueue concept and make it implicit. That is, if a user ties a CUDA stream to an MPIX_Stream, then let's require/assume every operation on such stream is an enqueue. That will allow us to provide users with all these enqueue functionality without adding additional APIs. We also don't need to provide MPIX_Stream_synchronize.

0 replies

hzhou · 2022-04-19T00:53:04Z

hzhou
Apr 19, 2022
Maintainer Author

There are discussions on whether we need to tie the GPU (e.g. CUDA) stream to MPIX_Stream. What is suggested is direct interfaces as --

MPI_Send_enqueue(buf, ..., comm, enum stream_type, void *stream_value);

It is not so bad if all we needed is two explicit parameters. It makes sense from CUDA perspective, But consider the communication perspective --

We'd like to isolate GPU stream communication traffic from normal non-stream traffic. Normal traffic may involve GPU operations during progress. Without isolation, we pose restrictions and limitations on implementation.
NOTE: this is the same consideration that CUDA disallow cudaLaunchHostFunc to call cuda API.
We can implicitly allocate an internal stream for GPU stream traffic. But then we are forced to only use a single internal stream for all GPU stream communications. This is a global serialization that is undesirable.
If we use more than one internal stream, then we will need information on how to route traffic from one stream to a specific remote stream. This will force us to specify both local CUDA stream and remote CUDA streams per operation. The complexity can grow very quickly from here (think collectives).

0 replies

hzhou · 2022-04-22T03:33:53Z

hzhou
Apr 22, 2022
Maintainer Author

What is the behavior on MPI_Comm_dup from a stream communicator? I feel like we can go either way -- inherit as a stream comm or strip the stream info and reduce it to a normal comm. Stripping the stream info is definitely much simpler for implementations. One of the case for inheriting the streams may be: user can create a stream comm and pass the comm to a library. The library may not be aware of the stream allocation and simply treat it as MPI_THREAD_SINGLE. It will work if the dup retains the stream information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPIX_Stream #5908

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

MPIX_Stream #5908

hzhou Mar 26, 2022 Maintainer

MPIX_Stream

Background motivation

MPIX_Stream

Stream Communicator

Stream Multiplex Communicator

Rationale and Restrictions

Replies: 9 comments · 8 replies

hzhou Mar 26, 2022 Maintainer Author

Wee-Free-Scot Mar 28, 2022

hzhou Mar 28, 2022 Maintainer Author

hzhou Mar 28, 2022 Maintainer Author

hzhou Mar 28, 2022 Maintainer Author

hzhou Mar 28, 2022 Maintainer Author

hzhou Mar 28, 2022 Maintainer Author

Wee-Free-Scot Mar 28, 2022

hzhou Mar 28, 2022 Maintainer Author

yfguo Apr 7, 2022 Maintainer

hzhou Apr 7, 2022 Maintainer Author

hzhou Apr 8, 2022 Maintainer Author

hzhou Apr 8, 2022 Maintainer Author

hzhou Apr 19, 2022 Maintainer Author

hzhou Apr 22, 2022 Maintainer Author

hzhou
Mar 26, 2022
Maintainer

`MPIX_Stream`

`MPIX_Stream`

Replies: 9 comments 8 replies

hzhou
Mar 26, 2022
Maintainer Author

Wee-Free-Scot
Mar 28, 2022

hzhou Mar 28, 2022
Maintainer Author

hzhou Mar 28, 2022
Maintainer Author

hzhou Mar 28, 2022
Maintainer Author

hzhou Mar 28, 2022
Maintainer Author

hzhou Mar 28, 2022
Maintainer Author

Wee-Free-Scot
Mar 28, 2022

hzhou Mar 28, 2022
Maintainer Author

yfguo
Apr 7, 2022
Maintainer

hzhou Apr 7, 2022
Maintainer Author

hzhou
Apr 8, 2022
Maintainer Author

hzhou
Apr 8, 2022
Maintainer Author

hzhou
Apr 19, 2022
Maintainer Author

hzhou
Apr 22, 2022
Maintainer Author