Replies: 15 comments
-
cc @jczhang07 |
Beta Was this translation helpful? Give feedback.
-
1.1 MPIX_Stream lives before and after the stream communicators that are using the stream. 2 Each MPIX Stream is associated with distinct internal network endpoint to isolate communication resources. The network endpoint need be initialized for connections. It is the establishing connections part that prevents the dynamic allocation |
Beta Was this translation helpful? Give feedback.
-
To be precise, the
It's not so much necessary as it is a convenience :). Consider the following use-case: PetscErrorCode VecDoSomethingCollective(Vec v, ...)
{
MPI_Comm comm;
MPI_Comm scomm;
MPIX_Stream mpi_stream;
// every PETSc object (e.g. Vec) holds a corresponding communicator
PetscObjectGetComm(v, &comm);
PetscGetMPIXStream(cuda_stream, &mpi_stream);
MPIX_Stream_comm_create(comm, mpi_stream, &scomm);
MPIX_Allreduce_enqueue(..., scomm);
// Performant code probably should now cache scomm somewhere, likely needs reference counting to ensure
// it is properly managed. This adds more complexity
} We already have a PetscErrorCode VecNorm_CUDA(Vec v, ..., PetscScalar *norm)
{
MPI_Comm comm;
MPIX_Stream mpi_stream;
PetscObjectGetComm(v, &comm);
PetscGetMPIXStream(cuda_stream, &mpi_stream);
MPIX_Comm_set_stream(comm, mpi_stream);
MPIX_Allreduce_enqueue(..., comm);
// no need to deal with extra comm object!
}
Can you explain this limitation more directly? I am not at all familiar with networking. |
Beta Was this translation helpful? Give feedback.
-
Also, is there a invalid struct PetscMPIXStream
{
bool init{};
MPIX_Stream stream{};
};
void foo(PetscMPIXStream &strm)
{
if (!strm.init) {
MPIX_Stream_create(..., &strm.stream);
strm.init = true;
}
// use strm.stream...
} Ideally we would want something like void foo(MPIX_Stream &strm)
{
if (strm == MPIX_STREAM_INVALID) { // for example
MPIX_Stream_create(..., &strm);
}
// use strm...
} |
Beta Was this translation helpful? Give feedback.
-
Petsc currently caches an internal duplicated comm, and I think it will need to cache a separate stream communicator. After all, the regular comm and stream comm are used very differently and it is cleaner not to mix them up.
|
Beta Was this translation helpful? Give feedback.
-
Hmmm, this implies you cannot use a stream comm in place of a regular comm? I.e. MPIX_Stream_comm_create(..., &scomm);
MPI_Allreduce(..., scomm);
Indeed. The code snippet linked in the description stores all created objects in a std::unordered_map<cudaStreamId_t, std::unordered_map<MPI_Comm, MPI_Comm>> stream_comm_cache;
MPI_Comm scomm = stream_comm_cache[cuda_stream.get_id()][src_comm];
What should the value of Suppose however that we have decided on a value for it. How should external code (i.e. PETSc) go about setting this variable? It leaks the abstraction to ask users to set setenv("MPIR_CVAR_CH4_RESERVE_VCIS", "whatever");
MPI_Init(&argc, &argv); is also unsavory. It means that PETSc must be the one to initialize MPI. We currently allow users to initialize it themselves prior to |
Beta Was this translation helpful? Give feedback.
-
That is correct. Actually, there are regular stream comms with regular streams (not CUDA streams). The regular stream comms can be used in place of a regular comm. The stream comms that are attached with CUDA streams have different semantics, i.e. all operations need be asynchronously queued to the CUDA stream, thus they are not interchangeable with regular comms.
We are well aware of the inconvenience, and I think it is possible to make it a bit implicit or dynamic. First we are focusing on the usability and functionality, then we will address the convenience part especially those that require more effort. |
Beta Was this translation helpful? Give feedback.
-
Ah, that is much easier to handle then!
Of course. On this note, we would be more than happy to collaborate closely to help iron out any kinks. I am sure a good stress test of |
Beta Was this translation helpful? Give feedback.
-
I am getting
what does this error indicate? I am doing (roughly) MPIX_Irecv_enqueue(...);
...
MPIX_Waitall_enqueue(...); // error fires here (@jczhang07 this is error is firing from Note that I do not get this error if I |
Beta Was this translation helpful? Give feedback.
-
In |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
The default is using
This approach involves a wait kernel -- a cuda kernel that busy waits on an external atomic variables. We are hitting some deadlock issues. This is likely because CUDA runtime is unaware of the dependency and may have extra locks or synchronizations that is causing the deadlock. Effort is needed to pin-down the issue and work out a mechanism to ensure the robutsness, as well as verifying the performance. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
OK I've also tried this, but am running into errors. For reference I am calling
|
Beta Was this translation helpful? Give feedback.
-
I got the exact same error for the official testsuite (release
I am not sure it is a ucx related issue or the MPICH itself. The bright side is I found this thread with exact same symptoms, the dark side is it seems that this thread is dead -- after more than a year and the problem still exist with the latest official release (v4.2.2) and no further information. I wonder if the OP got any progress on this part. |
Beta Was this translation helpful? Give feedback.
-
I am a developer for PETSc (down the hall), and am working on integrating an experimental GPU stream-aware MPI layer. We are using MPICH’s experimental
MPIX_Stream
extension, but are running into some difficulties/have some questions.For reference, the relevant PETSc code for this new feature can be found in the function here
I have summarized the biggest ones below, but I am sure there will be many more...
Just to refresh everyone's memory, you must do
———
1.1. What is the lifetime of the
MPIX_Stream
vis-a-vis the stream comm (scomm
)? Can it outlivescomm
? If not, what order should we destroympi_stream
andscomm
?1.2. What is the lifetime of the
MPIX_Stream
vis-a-vis the underlying device stream? I assume that theMPIX_Stream
must be destroyed immediately before the CUDA stream is destroyed.1.3. Do we need a brand new
scomm
for eachcudaStream_t
-MPI_Comm
pairing? Reading throughMPIR_Stream_comm_create_impl()
this appears to be the case.1.4. Following on from 1.3, can we reuse the same
MPIX_Stream
to creating multiple separatesrc_comm
s (but same underlying CUDA stream)?1.5. Following on again from 1.3, any way to make a non-stream communicator into a stream comm in-place? I.e. something like
MPIX_Set_stream(PETSC_COMM_WORLD, some_mpix_stream)
?———
Which I dutifully rectify by doing
But this seems clunky, and leads to me to believe I have not done some other necessary setup. Why the environment variable? To me this implies some one-time setup, likely in
MPI_Init()
. Surely there is a way to dynamically allocate more streams while the application is running?Beta Was this translation helpful? Give feedback.
All reactions