Skip to content

Progress between RMA/P2P/Collectives #17

@devreal

Description

@devreal

Since not everyone in the RMA WG participates in the Terms WG and the discussion of progress rules will be handled there, I wanted to get a feel for what people here think about how progress of RMA operations and by RMA synchronization calls should be defined in the future. The current wording is not precise on which guarantees users can expect in terms of progress of outstanding non-RMA communication operations when calling MPI_Win_test for example or progress of RMA operations when calling non-RMA procedure calls.

Here are two examples:

Example 1: Progress of non-RMA operations in RMA calls:

if (rank == 0) {
  MPI_Request sreq, rreq;
  MPI_Isend(&large_buffer, &sreq);
  MPI_Rput(..., &rreq);
  while (!flag) MPI_Win_test(&flag);
  MPI_Wait(&sreq);
} else if (rank ==1) {
  MPI_Recv(&large_buffer);
  MPI_Win_complete(...);
}

Question: should MPI_Win_test guarantee progress of the send?

Example 2: Progress of RMA operations in non-RMA calls:

if (rank ==0) {
  MPI_Request areq, breq;
  MPI_Raccumulate(..., &areq); // assuming operation not supported by HW, may fall back to AM
  while (!flag) MPI_Test(&areq, &flag); // assuming we can do something useful in between
  MPI_Send(large_message);
} else if (rank == 1) {
  MPI_Request rreq;
  MPI_Irecv(large_message, ..., &rreq); // complete the barrier before completing the RMA epoch
  while (!flag) MPI_Test(&rreq); // assuming we can do something useful in between
}

Here, the only option I can see for completion is for the test on the receive request to progress the accumulate operations.

From a user perspective, I expect both programs to be correct since I am continuously calling into MPI, giving the implementation a chance to progress any outstanding operations that the operations I am polling on might depend on. Of course, the MPI implementation has no knowledge of such dependencies.

In previous discussions (mpi-forum/mpi-issues#499) that point was raised that the RMA synchronization functions should not have to progress non-RMA operations to avoid the added latency, which would render the first example incorrect.

The question now is: what expectations do people have in terms of progress inside RMA synchronization functions? I see three options:

  1. Guaranteed progress of any outstanding operation: a call to any RMA synchronization function has to ensure progress of outstanding non-RMA operations (if possible, of course). This may not be required on every call but the observable behavior should be that eventually all operations complete.
  2. No progress guarantees in RMA: RMA and non-RMA progress is unidirectional: RMA synchronization operations are not required to progress non-RMA operations. Isolation in both directions is likely not feasible due to passive target RMA.

Example 3: Guaranteed progress

if (rank == 0) {
  int signal = 0;
  MPI_Request sreq;
  MPI_Isend(large_message, 1, &sreq);
  while (!signal) { // poll for a signal to be set by rank 1
    MPI_Get(&signal, myrank, ...); 
    MPI_Win_flush_local(myrank); // the get completes immediately
  }
  MPI_Wait(&sreq);
} else if (rank == 1) {
  int signal = 1;
  MPI_Recv(large_message, 0);
  MPI_Put(&signal, 0);
  MPI_Win_flush(0);
}

Without progress guarantees from MPI_Win_flush_local, the application would be required to test on sreq to ensure completion of the send. If the send and the get were issued by different user libraries the application would have to ensure that the progress dependencies are correctly handled (on top of the data dependencies of initiated operations). Similarly, what behavior is expected when instead of using MPI_Win_flush_local we use MPI_Rget+MPI_Wait? Do we expect progress of the send even though the local get completed immediately?

I can see the argument that RMA communication is esp. latency sensitive and any additional progress may be costly. On the other hand, implementations may have some leeway to limit progress of non-RMA operations to every N'th call, reducing the impact on latency while ensuring eventual completion of communication dependencies.

In any case, the MPI standard should clearly outline what constitutes a correct program. Ignoring potential breakage of existing software, we could define the expectations from the RMA point of view either way and I would appreciate any input from the RMA working group :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions