-
Notifications
You must be signed in to change notification settings - Fork 9
Description
This is not our bug and we will not fix it, but the details are documented here for posterity.
There is a bug in Intel MPI 2021.10 and Cray MPI 8.1.29 when using request-based RMA (#53). It could be an MPICH bug in the argument checking macros but I tested MPICH 4.2 extensively today and it does not appear there.
In MPI_Rget_accumulate(NULL, 0, MPI_BYTE, .. , MPI_NO_OP, ..), the implementation incorrectly says that MPI_BYTE has not been committed.
Reproducer by running this in e.g. /tmp:
. /opt/intel/oneapi/setvars.sh --force
git clone --depth 1 https://github.com/jeffhammond/armci-mpi -b request-based-rma
cd armci-mpi
./autogen.sh
mkdir build
cd build
../configure CC=/opt/intel/oneapi/mpi/2021.10.0/bin/mpicc --enable-g
make -j checkprogs
export ARMCI_VERBOSE=1
mpirun -n 4 ./tests/contrib/armci-test # this fails
export ARMCI_RMA_ATOMICITY=0 # this disables MPI_Rget_accumulate(MPI_NO_OP)
mpirun -n 4 ./tests/contrib/armci-tes # this works
It fails here:
Testing non-blocking gets and puts
local[0:2] -> remote[0:2] -> local[1:3]
local[1:3,0:0] -> remote[1:3,0:0] -> local[1:3,1:1]
local[2:3,0:1,2:3] -> remote[2:3,0:1,2:3] -> local[1:2,0:1,2:3]
local[2:2,1:1,3:5,1:5] -> remote[4:4,0:0,1:3,1:5] -> local[3:3,1:1,1:3,2:6]
local[1:4,1:1,0:0,2:6,0:2] -> remote[1:4,2:2,1:1,2:6,1:3] -> local[0:3,1:1,5:5,2:6,2:4]
local[1:4,0:2,1:7,5:6,0:6,1:2] -> remote[0:3,0:2,1:7,7:8,0:6,0:1] -> local[0:3,0:2,0:6,3:4,0:6,0:1]
local[3:4,0:1,0:0,5:7,5:6,0:1,0:1] -> remote[1:2,0:1,0:0,5:7,2:3,0:1,0:1] -> local[0:1,0:1,4:4,2:4,3:4,0:1,0:1]
Abort(336723971) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Rget_accumulate: Invalid datatype, error stack:
PMPI_Rget_accumulate(218): MPI_Rget_accumulate(origin_addr=(nil), origin_count=0, MPI_BYTE, result_addr=0x60d4e105cfd0, result_count=1, dtype=USER<contig>, target_rank=3, target_disp=8, target_count=1, dtype=USER<contig>, MPI_NO_OP, win=0xa0000001, 0x7ffc044c9558) failed
PMPI_Rget_accumulate(159): Datatype has not been committed
MPI_BYTE does not need to be committed.
This is a patch that works around the Intel MPI bug, and therefore reveals the problem:
diff --git a/src/gmr.c b/src/gmr.c
index 129b97c..acf8539 100644
--- a/src/gmr.c
+++ b/src/gmr.c
@@ -603,7 +603,9 @@ int gmr_get_typed(gmr_t *mreg, void *src, int src_count, MPI_Datatype src_type,
MPI_Request req = MPI_REQUEST_NULL;
if (ARMCII_GLOBAL_STATE.rma_atomicity) {
- MPI_Rget_accumulate(NULL, 0, MPI_BYTE,
+ // using the source type instead of MPI_BYTE works around an Intel MPI 2021.10 bug...
+ MPI_Rget_accumulate(NULL, 0, src_type /* MPI_BYTE */,
dst, dst_count, dst_type, grp_proc,
(MPI_Aint) disp, src_count, src_type,
MPI_NO_OP, mreg->window, &req);The setting ARMCI_RMA_ATOMICITY=0 disables this code path in favor of the following MPI_Get, which works just fine with the same arguments except for the (NULL,0,MPI_BYTE) tuple, which of course is unused.