Explore the potential of RMA-based APIs using libfabric.
Please cite out paper (see bellow).
PMI (client) and hydra (server) provide process management abilities.
Download them from the mpich website.
To build rmem, only the PMI library is needed, although hydra (or any other PMI) server is needed to execute rmem.
Here are some examples of useful commands with hydra:
# run 2 processes, 1 process-per-node, label the output
mpiexec -n 2 -ppn 1 -l
# run 2 processes, bind each of them to a core
mpiexec -bind-to coreTo build libfabric, there are different options (ex: brew install libfabric on macos), here is how to do it from source:
./autogen.sh
CC=$(CC) CXX=$(CXX) ./configure --prefix=$(YOUR_PREFIX) --enable-psm3 --enable-sockets
# for CUDA support, add
--with-cuda=${CUDA_HOME} --with-gdrcopy
# for AMD support, add
--with-rocr=${ROCM_PATH}
make install -j 8To build CXI (Slingshot-11 provider), you will need some workarounds:
- the main branch doesn't build on most supercomputer (lib-cxi is too old, see here), instead use this branch
- install json-c with from source with
cmake . -DCMAKE_INSTALL_PREFIX=${HOME}/json-cand add the option--with-json=${HOME}/json-c
You can check that the build is working using fi_pingpong:
# with mpiexec
mpiexec -n 1 ./fi_pingpong : -n 1 ./fi_pingpong localhost
# or without mpiexec
fi_pingpong & fi_pingpong localhostWe use a Makefile to compile.
To handle the different systems, the file make_arch/default.mak contains the different variable definitions needed to find the dependencies etc.
Specifically we rely on the following variables:
CCgives the compiler to usePMI_DIRthe root directory ofpmiOFI_DIRthe root directory ofofiOPTS(optional) contains flags to be passed to the compilers for more flexibility. E.g.-fsanitize=address,-fltoetc
The Makefile offers various targets by defaults:
rmem: buildsrmeminfo: display info about the buildclean/reallyclean: cleans the builddefault: displays the info and buildrmemfast: compiles for fast execution (equivalent toOPTS=-O3 -DNDEBUG)debug: compiles with debug symbols (equivalent toOPTS=-O0 -g)verbose: compiles for debug with added verbosity (equivalent toOPTS=-DVERBOSE make debug)asan: compiles with debug symbols (equivalent toOPTS=-fsanitize=address -fsanitize=undefined make verbose)
Note: if you prefer to add another make_arch file, you can also invoke it using ARCH_FILE=make_arch/myfile make.
the ready-to-receive protocol is used to expose readiness to reception by the target to the origin of the RMA call.
am: will use active messaging (fi_send) and pre-posted buffers at the sendertag: will use tagged messaging (fi_tsendandfi_trecv). The main performance bottleneck is unexpected messagesatomic: uses an atomic operation (fi_atomic)
am: will usefi_sendand pre-posted buffers at the sendertag: will usefi_tsendandfi_trecv. The main performance bottleneck is unexpected messagescq_datausesfi_cq_datato close the epoch, to be used with-c order
deliveryuses delivery complete (FI_DELIVERY_COMPLETE) on the payload operationfenceuses a fence to issue the down-to-close acknowledgmentcq_datausesFI_CQ_DATAto track remote completioncounterusesFI_REMOTE_COUNTERto track remote completion using remote countersorderuse network ordering, must be used with-d cq_data
Different networks have different capabilities and limitations, here is a list of the restrictions we have encountered:
psm3: does not support RMA natively, emulated in software over tag messaging, see hereverbs;ofi_rxm: poor native support ofFI_ATOMICcxi: doesn't supportFI_CQ_DATAfor the momentsockets: supports everything exceptFI_REMOTE_COUNTER
To be announced
/*
* Copyright (c) 2024, UChicago Argonne, LLC
* See COPYRIGHT in top-level directory
*/