Progress between RMA/P2P/Collectives

Since not everyone in the RMA WG participates in the Terms WG and the discussion of progress rules will be handled there, I wanted to get a feel for what people here think about how progress of RMA operations and by RMA synchronization calls should be defined in the future. The current wording is not precise on which guarantees users can expect in terms of progress of outstanding non-RMA communication operations when calling `MPI_Win_test` for example or progress of RMA operations when calling non-RMA procedure calls.

Here are two examples:

**Example 1: Progress of non-RMA operations in RMA calls**:
```C
if (rank == 0) {
  MPI_Request sreq, rreq;
  MPI_Isend(&large_buffer, &sreq);
  MPI_Rput(..., &rreq);
  while (!flag) MPI_Win_test(&flag);
  MPI_Wait(&sreq);
} else if (rank ==1) {
  MPI_Recv(&large_buffer);
  MPI_Win_complete(...);
}
```

*Question:* should `MPI_Win_test` guarantee progress of the send?

**Example 2: Progress of RMA operations in non-RMA calls:**
```C
if (rank ==0) {
  MPI_Request areq, breq;
  MPI_Raccumulate(..., &areq); // assuming operation not supported by HW, may fall back to AM
  while (!flag) MPI_Test(&areq, &flag); // assuming we can do something useful in between
  MPI_Send(large_message);
} else if (rank == 1) {
  MPI_Request rreq;
  MPI_Irecv(large_message, ..., &rreq); // complete the barrier before completing the RMA epoch
  while (!flag) MPI_Test(&rreq); // assuming we can do something useful in between
}
```

Here, the only option I can see for completion is for the test on the receive request to progress the accumulate operations.

From a user perspective, I expect both programs to be correct since I am continuously calling into MPI, giving the implementation a chance to progress any outstanding operations that the operations I am polling on might depend on. Of course, the MPI implementation has no knowledge of such dependencies.

In previous discussions (https://github.com/mpi-forum/mpi-issues/issues/499) that point was raised that the RMA synchronization functions should not have to progress non-RMA operations to avoid the added latency, which would render the first example incorrect.

The question now is: what expectations do people have in terms of progress inside RMA synchronization functions? I see three options:

1) *Guaranteed progress* of any outstanding operation: a call to any RMA synchronization function has to ensure progress of outstanding non-RMA operations (if possible, of course). This may not be required on every call but the observable behavior should be that eventually all operations complete.
2) *No progress* guarantees in RMA: RMA and non-RMA progress is unidirectional: RMA synchronization operations are not required to progress non-RMA operations. Isolation in both directions is likely not feasible due to passive target RMA.

**Example 3: Guaranteed progress**
```C
if (rank == 0) {
  int signal = 0;
  MPI_Request sreq;
  MPI_Isend(large_message, 1, &sreq);
  while (!signal) { // poll for a signal to be set by rank 1
    MPI_Get(&signal, myrank, ...); 
    MPI_Win_flush_local(myrank); // the get completes immediately
  }
  MPI_Wait(&sreq);
} else if (rank == 1) {
  int signal = 1;
  MPI_Recv(large_message, 0);
  MPI_Put(&signal, 0);
  MPI_Win_flush(0);
}
```
Without progress guarantees from `MPI_Win_flush_local`, the application would be required to test on `sreq` to ensure completion of the send. If the send and the get were issued by different user libraries the application would have to ensure that the *progress dependencies* are correctly handled (on top of the *data dependencies* of initiated operations). Similarly, what behavior is expected when instead of using `MPI_Win_flush_local` we use `MPI_Rget`+`MPI_Wait`? Do we expect progress of the send even though the local get completed immediately?

I can see the argument that RMA communication is esp. latency sensitive and any additional progress may be costly. On the other hand, implementations may have some leeway to limit progress of non-RMA operations to every N'th call, reducing the impact on latency while ensuring eventual completion of communication dependencies.

In any case, the MPI standard should clearly outline what constitutes a correct program. Ignoring potential breakage of existing software, we could define the expectations from the RMA point of view either way and I would appreciate any input from the RMA working group :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Progress between RMA/P2P/Collectives #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Progress between RMA/P2P/Collectives #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions