Replies: 9 comments 8 replies
-
This is an interesting contribution, thanks! I see two forms of prior art that need to be carefully thought through and compared with this proposal. A brand new API has the advantage of complete freedom but also the disadvantage of adding a whole new API with new/different semantics that are in addition to the extant knowledge and expertise. The first prior art is mentioned in the proposal itself: the endpoints proposal (and spin-offs like the finepoints proposal, which is now partitioned communication in MPI-4.0). The second prior art is MPI Sessions (version 1 of which is now in MPI-4.0). This link is not immediately apparent, I guess. The sessions proposal permits thread support level to be specified for each session and to be different for each session. The isolation of sessions is intended to mean that questions like "what happens if there are concurrent calls using different sessions?" are easily answered with "sessions are isolated from each other; there are no cross-session threading problems" (which might not actually work out in practice, but was/is the intent). I think, therefore, that it is worth comparing your MPIX_STREAM with MPI_Session(support_level:=MPI_THREAD_SERIALIZED). All procedure calls that use an MPI object derived from such as "serialised session" are guaranteed (by the applicable thread support level) to be serialised w.r.t. each other (but not, necessarily, w.r.t. other MPI calls that use MPI objects derived from other sessions). I think we already have the serial execution context desired for MPIX_STREAM. The global thread support level (equivalently, the cross-session thread support level) is not defined in MPI-4.0 -- mostly intentionally, because it should not be necessary. Given the existing "outcome as if called in order" general proclamation for all MPI calls globally within the MPI process and the isolation of sessions, it should already be clear what it permitted, what is prohibited, and what the outcome should be for any execution pattern. This is a big assertion -- as we become aware of counter-examples to this, then we should strive to plug the holes in the current definitions and specification text. I think we already have the MPI_THREAD_MULTIPLE global execution context desired for MPIX_STREAM. We should look very carefully at the difference between the programming model context and the execution model context. The mapping from one to the other (by compilers) generally depends on visible side-effects, i.e. outcomes (as mentioned above). Re-ordering (by the compiler or OS/execution runtime) of the execution of MPI procedure calls that use objects derived from the same MPI session should be disallowed. Interweaving of these sequential "streams" of serialised procedure calls is permitted and encouraged if it improves performance. We would need to look carefully at fairness: drain session1 commands, then look at session2 commands might not be a good implementation choice, even if it is a legal choice. The multiplexing to emulate N-to-1 endpoints suggests an MPI session using MPI_THREAD_MULTIPLE as its thread support level -- the scope of its thread level is still limited to MPI procedure calls using an MPI object derived from the MPI session but now it must accept concurrent calls (from any thread/execution context) using those MPI objects. This requires the implementation to multiplex for that MPI session, even though it is not required to multiplex for other MPI sessions (if they have a lower thread support level). The isolation of MPI sessions seems like the weakest point of this whole edifice. Are they isolated enough to achieve the goal? |
Beta Was this translation helpful? Give feedback.
-
@hzhou these are good questions/comments. First, the local session here has no direct relationship with any remote session at some other MPI process. As you say, it is a local object. My session here might request and be provided the MPI_THREAD_MULTIPLE thread support level, whereas your session over there might request and be provided the MPI_THREAD_FUNNELLED thread support level. I can use any thread with a communicator derived from my local session to send a message to you. You are required to use a particular thread to receive those messages because your communicator was derived from your local session. We have an N-to-1 thread context marshalling/multiplexing situation. Does this address your comment or clarify anything for you? Second, there are (at least) two separate existing conceptual interpretations/scopes for any serial context that some make sense here. I think MPIX_STREAM is proposing a third scope, which is not ideal if one of the existing scopes is sufficient.
Your second point is aligned with good design principles, specifically things "should do one thing and do it well". Sessions is a bit of a monster, a bandwagon that has accreted all kinds of additional baggage during its journey and shows no sign of stopping. On the other hand, I can see both a super-context (bigger than or equal to a single MPI object) and a sub-context (smaller than or equal to a single MPI object) scope for serialisation/ordering/isolation. Endpoints were a super-context scope; thread support levels are super-context modifiers; sessions are super-context scope; MPIX_STREAM seems to be sub-context scope, which is otherwise missing. Is such a scope useful and/or essential for users and/or implementors of MPI? |
Beta Was this translation helpful? Give feedback.
-
Just adding notes from the discussion with Jim:
|
Beta Was this translation helpful? Give feedback.
-
A big part of the discussion is how do we support CUDA graph or whether MPI should support CUDA graph. Semantically, to support CUDA graph, we need a function something like:
Unlike a stream, which to a certain abstraction, is a serial execution context; a CUDA graph is in semantic a task to be inserted into a stream or as a node to be inserted into an execution graph. Thus, if we remove the specific reference to CUDA, it can be generalized as a task. The issue for task is it is tied to the task runtime. A CUDA graph will only have meaning to a CUDA runtime. An OpenMP task will only have meaning with OpenMP. The creation of the CUDA graph has to be via CUDA API. Similarly an OpenMP task. I am phrasing this as such to raise the question that do we really want to handle the task creation inside MPI or can we just leave it to user or to a task runtime library that sits on top of MPI? @jdinan From your experience of building a prototype on top of MPI, what's the advantage to move the CUDA graph creation inside an MPI implementation versus do it in a library above MPI? |
Beta Was this translation helpful? Give feedback.
-
Second thought on We'll need one such "enqueue" function for every operation current MPI has, that shall give every one a pause. One alternative is to let go of the generalization of enqueue concept and make it implicit. That is, if a user ties a CUDA stream to an |
Beta Was this translation helpful? Give feedback.
-
There are discussions on whether we need to tie the GPU (e.g. CUDA) stream to
It is not so bad if all we needed is two explicit parameters. It makes sense from CUDA perspective, But consider the communication perspective --
|
Beta Was this translation helpful? Give feedback.
-
What is the behavior on |
Beta Was this translation helpful? Give feedback.
-
MPIX_Stream
Background motivation
MPI is ambiguous on its multi-thread model.
Most of the text implies a serial model --
Yet, sporadically, the text also emphasizes its intention to encourage concurrency in
MPI+thread
.Concurrency in MPI happens in the same way as compiler optimization of a serial code, guided by outcomes.
The default behavior of a typical implementation is to use global critical section, thus all MPI operations are serialized. Then complicated and smart implementations tries its magic to skip or yield the big lock whenever it can get away with it.
This ambiguity destroys
MPI+thread
performance -- extra thread synchronization is always required even when it is okay to skip.However, the user almost always knows when it is okay to skip thread synchronization. Best multi-thread application always designed to avoid/minimize thread synchronization.
User can't tell MPI because MPI does not have or refuse to acknowledge the thread concept -- "the outcome will be as if the calls executed in some order"
GPU application
Prevalent GPU programming model is "stream enqueue" -> "queue/graph optimization" -> "stream synchronization"
A GPU stream is again a serial execution context.
Directly passing GPU stream into MPI is hacky and it is bound to confuse/conflict with the existing API, which does not promise a serial execution context.
Without MPI concept of "serial operation context", both application and implementations are complicated, convoluted, and under-performant
With the MPI concept and entity, both applications and implementations can be simple and direct, achieving ideal
MPI+thread/gpu
performance.MPIX_Stream
MPIX_Stream
- a serial operation context:MPIX_Stream_create(MPI_Info info, MPIX_Stream *stream)
Even with
MPI_INFO_NULL
, anMPIX_Stream
distinguishes independent serial operation context. It provides a facility for users to attach another runtime system's serial context to MPI for interoperability. For example, user can set info "type"="cudaStream_t", "id"=cuda_stream to have a cuda-aware implementation to inter-op with cuda applications.MPIX_STREAM_NULL
represents the default stream all the traditional MPI APIs use.Passing an opaque stream variable via info hints require consistent encoding/decoding. We provide
As long as MPI is concerned, operations on different streams (including the default stream) are concurrent.
NOTE:
MPIX_Stream
is local regarding its serial context. However, in communication, pairing and matching between the origin stream and the target stream are often significant. Imagine when a stream accidentally receives a message destined to another stream, forcing an inter-stream synchronization that it's desirable to avoid.The addition of
MPIX_Stream
essentially allowsMPI_THREAD_SERIALIZED
(on a stream) withinMPI_THREAD_MULTIPLE
globally.Implementor note: If there isn't resource available (VCI, including no VCI support), return failure. If the specified type is not supported, return fail.
Stream Communicator
We use the stream to setup "Stream Communicators" -
MPIX_Stream_comm_create(MPI_Comm parent_comm, MPIX_Stream stream, MPI_Comm *stream_comm);
Each process can specify its own participating stream (or
MPIX_STREAM_NULL
).YES! This is essentially a revival of the endpoint proposal, without the complication of endpoints.
All traditional MPI operation works on stream communicators, with a strong serial context. It is illegal to issue concurrent operations on a stream communicator!
We support a set of "enqueue" operations with a stream communicator and on processes with non-default stream attached.
MPIX_Send_enqueue
MPIX_Recv_enqueue
Alternatively, if the attached stream is an enqueue-only stream, e.g.
cudaStream_t
, then we can make all MPI operations on this stream communicator (per local process) is an implicit enqueue operation. Effectively,MPIX_Send_enqueue
is simply an alias toMPIX_Send
, but more explicitly.Stream Multiplex Communicator
What it can't achieve (but the old endpoints proposal does) is an "N-to-1" communication pattern (common in tasks/dispatch systems):
What we needed is a stream multiplex communicator (this is a bit hairy, but it is very specific, so hopefully some complexity is allowed to address a complicated scenario)--
To use it --
In receive,
src_index
can be "any" (-1
).It's essentially an endpoints communicator, but address it with
rank+index
rather than an endpoint rank. Users do not need to maintain the rank/thread mapping.Rationale and Restrictions
Hopefully, this is a series of (intuitive) generalizations --
src_index
anddst_index
are both0
.Restrictions on multiplex communicators:
Restrictions on mixing "enqueue" and immediate operations on a stream
The Biggest Kick
With stream communicator, it's possible for an implementation to safely remove all internal locks, thus achieving ideal
MPI+thread
performance.NOTE: prototype in PR #5906. In particular, example code: https://github.com/pmodels/mpich/blob/fa71d8970e8014765e398f8ae66ff763ef8d1ab3/test/mpi/impls/mpich/gpu/stream.cu
Beta Was this translation helpful? Give feedback.
All reactions