Hang in one-sided communications across multiple nodes #7118

victor-anisimov · 2024-08-27T18:49:43Z

Once-sided communications between GPU pointers intermittently hang on Aurora. The attached reproducer runs on 96 nodes using six 16-node subcommunicators. One needs to run about 10-50 jobs in order to catch the hang. Successful job contains "All Done" in the output file. If that string is absent after the time runs out, it indicates the hang. The job requires 2-3 minutes of walltime to complete. The job that does not finish in 5 min most likely hangs and won't progress.

hzhou · 2024-08-28T20:46:21Z

@victor-anisimov I built a debug version of mpich on Aurora at /home/hzhou/pull_requests/mpich-debug/_inst. Could you try link with that and run the reproducer while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1? The goal is to generate a backtrace as shown in #7084

I am not sure the easiest way to switch libmpi.so. Try

export LD_LIBRARY_PATH=/home/hzhou/pull_requests/mpich-debug/_inst/lib:$LD_LIBRARY_PATH

and verify with ldd -- [EDIT: that likely won't work due to so version mismatch]. If that doesn't work, try

LD_PRELOAD=/home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0

victor-anisimov · 2024-08-28T21:43:20Z

Thank you for the suggestion, @hzhou ! I linked the reproducer code against your version of MPICH by doing
module unload mpich
export PATH=${PATH}:/home/hzhou/pull_requests/mpich-debug/_inst/bin
mpif90 -fc=ifx -O0 -g -i4 -r8 -what -fiopenmp -fopenmp-targets=spir64 test.F90
ldd ./a.out
libmpifort.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpifort.so.0 (0x00007f017ddf7000)
libmpi.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0 (0x00007f017b486000)
./bulk.sh

The test job crashed at initialization with the error message:
forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
libpthread-2.31.s 00001460507998C0 Unknown Unknown Unknown
a.out 000000000040C7E2 Unknown Unknown Unknown
a.out 000000000040C47D Unknown Unknown Unknown
libc-2.31.so 00001460505C324D __libc_start_main Unknown Unknown
a.out 000000000040C3AA Unknown Unknown Unknown

I guess the MPICH build configuration causes this crash. Could you take the reproducer, which is very simple, and try to run the test?

reproducer6.tgz

hzhou · 2024-08-29T04:06:20Z

module unload mpich

Make sure to set PALS_PMI=pmix if you unload mpich module.

while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1

Make it MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=10. 1 second timeout was too short. 10 second seems to be sufficient for normal progress without hung.

Could you take the reproducer,

Running 100 loops now (or until my 2 hour walltime kicks me out), will update.

hzhou · 2024-08-29T12:16:57Z

I think I got some timeouts, need confirm with larger MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT -

...
Rank 0 conducted      1148200 one-sided messages
Rank 0 conducted      1148300 one-sided messages
3 pending requests in pool 0
    ac000004: MPIDI_OFI_do_irecv: source=4, tag=14, data_sz=0
/home/hzhou/MPI/lib/libmpi.so.0(+0xc1bad4) [0x14ab09776ad4]
/home/hzhou/MPI/lib/libmpi.so.0(+0x954a0b) [0x14ab094afa0b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x9557cd) [0x14ab094b07cd]
/home/hzhou/MPI/lib/libmpi.so.0(+0x955167) [0x14ab094b0167]
/home/hzhou/MPI/lib/libmpi.so.0(+0x802a58) [0x14ab0935da58]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6bf95b) [0x14ab0921a95b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6d1866) [0x14ab0922c866]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7977d8) [0x14ab092f27d8]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797aa3) [0x14ab092f2aa3]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7caee0) [0x14ab09325ee0]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7ca90e) [0x14ab0932590e]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7cac96) [0x14ab09325c96]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797f97) [0x14ab092f2f97]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797c72) [0x14ab092f2c72]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4db4dc) [0x14ab090364dc]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da580) [0x14ab09035580]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da3f9) [0x14ab090353f9]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da1cc) [0x14ab090351cc]
/home/hzhou/MPI/lib/libmpi.so.0(MPI_Win_fence+0x19) [0x14ab09034e69]
/home/hzhou/MPI/lib/libmpifort.so.0(mpi_win_fence_f08_+0x2d) [0x14ab0b5898ad]
./a.out() [0x412013]
./a.out() [0x40c4fd]
/lib64/libc.so.6(__libc_start_main+0xef) [0x14ab0782d24d]
./a.out() [0x40c42a]
...
Abort(546498063) on node 864: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
Abort(546498063) on node 876: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
...

It is stuck in a internal Barrier.

hzhou · 2024-08-29T12:23:34Z

 grep 'All Done' job*.out
job10.out:All Done
job11.out:All Done
job12.out:All Done
job13.out:All Done
job14.out:All Done
job16.out:All Done
job17.out:All Done
job18.out:All Done
job19.out:All Done
job1.out:All Done
job21.out:All Done
job22.out:All Done
job25.out:All Done
job26.out:All Done
job27.out:All Done
job28.out:All Done
job29.out:All Done
job2.out:All Done
job30.out:All Done
job31.out:All Done
job32.out:All Done
job33.out:All Done
job34.out:All Done
job35.out:All Done
job36.out:All Done
job37.out:All Done
job38.out:All Done
job39.out:All Done
job3.out:All Done
job40.out:All Done
job41.out:All Done
job42.out:All Done
job4.out:All Done
job6.out:All Done
job7.out:All Done
job8.out:All Done
job9.out:All Done
job.out:All Done

Timed out in job 5, 23, and 24.

victor-anisimov · 2024-08-29T19:56:11Z

I can run the test and am able to reproduce the backtrace after setting

module unload mpich
export PALS_PMI=pmix
export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

raffenet · 2024-08-29T20:32:53Z

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

Previous investigation was that one of the target processes somehow gets overwhelmed by MPI_GET requests and basically stops being able to send responses to the origin. We suspected a libfabric/cxi issue, but it was not clear why it would hang rather than report an error and crash. It's possible that #7117 may improve the error reporting in this case.

hzhou · 2024-08-29T21:43:16Z

Thanks @raffenet for the potential clue. I can see how that (single process get overwhelmed due to lower layer failure or congestion) may cause the issue.

victor-anisimov · 2024-08-30T13:16:37Z

I tried the latest commit b3480dd, which includes #7117, built by @colleeneb on Aurora. The behavior of the test is unchanged. I still get 8 hangs per 50 runs of the test. All 8 jobs got killed by the scheduler due to exceeding the walltime limit. None of the 8 hanging jobs include any error or warning messages from MPICH. My conclusion is that either I do not know how to properly run the MPICH to get those forward progress errors printed out or the fixes do not work for the hang in one-sided communications. Could @hzhou or @raffenet try using the latest MPICH with the one-sided bug reproducer to see if the forward progress fixes work for the present case?

hzhou · 2024-08-30T13:31:33Z

The one built by Colleen doesn't have --enable-g=progress on I believe. We suspect your case is due to the underlying network cannot handle certain congestion situations yet not raising an error. At this point, it is just guess. I need think about how to further narrow down the reproducer. Currently, a 1/20 chance thus hours of time for a single reproducer is a bit difficult to work with.

colleeneb · 2024-08-30T13:36:58Z

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

hzhou · 2024-08-30T13:45:13Z

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

@colleeneb Only do that for debug build (vs. performance build). For debug build, you may also want to use --enable-g=dbg,log,progress

iziemba · 2024-11-06T15:21:09Z

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

raffenet · 2024-11-06T15:37:41Z

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

@iziemba Invalid in what way? I see that user-provided keys are limited to 4-bytes in the fi_cxi manpage. It does look like MPICH could generate a key larger than that the way to code is setup today. But looking at the error checks, I would expect MPICH to get -FI_EKEYREJECTED from fi_mr_reg and return it to the user rather than proceed and hang.

raffenet · 2024-11-06T16:39:08Z

I tried adding a dummy registration to my MPICH build just to see what happens. Unless I am missing something, it looks like libfabric will not complain about the invalid key, so we need to implement some verification inside MPICH itself. Here is the code for the registration I tried:

    int ret;
    uint64_t key = UINT64_MAX;
    void *buf = MPL_malloc(16, MPL_MEM_OTHER);
    struct fid_mr *mr;
    ret = fi_mr_reg(MPIDI_OFI_global.ctx[0].domain,
                   buf, 16,
                   FI_REMOTE_READ, 0,
                   key, 0, &mr, NULL);
    printf("ret = %d, %s\n", ret, fi_strerror(ret));

and here is the output 😕

ret = 0, Success

raffenet · 2024-11-06T17:28:48Z

More info after building my own libfabric and adding debug printfs, the provider thinks the requested key is a provider key for some reason and therefore thinks it is valid. If I enable MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1 then the provider actually skips the validation step altogether because of the domain being setup to use prov keys exclusively, I believe.

raffenet · 2024-11-06T17:32:30Z

So when the user passes in an invalid requested key, it may accidentally cause cxi to think it is a provider key and take the cxip_key.is_prov branch and somehow pass the validation there...

bool cxip_generic_is_valid_mr_key(uint64_t key)
{
        struct cxip_mr_key cxip_key = {
                .raw = key,
        };

        if (cxip_key.is_prov)
                return cxip_is_valid_prov_mr_key(key);

        return cxip_is_valid_mr_key(key);
}

raffenet · 2024-11-06T19:43:31Z

After looking closer, I no longer thing MPICH will generate a key larger than 4 bytes in this scenario. Must be something else.

iziemba · 2024-11-06T20:20:49Z

For user keys, I can update CXI provider to have check for RKEY size and ensure it is <= 4 bytes.

FI_MR_PROV_KEY key size is 8 bytes. When running with MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1, I verified that libfabric CXI provider was returning expected generated RKEYs. Then, I see MPICH use an invalid key. Example:

libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4f40f70 (key: 0xA000DD743707002F)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c37c10 (key: 0xA000EF2705D8004D)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894a10 (key: 0xA000A22111480016)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x58947b0 (key: 0xA000A2211149001B)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367ec40 (key: 0xA000EF2705D9003C)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367e870 (key: 0xA000EF2705DA0028)
libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4e62670 (key: 0xA000DD7437080029)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894c70 (key: 0xA000A221114A0024)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894ed0 (key: 0xA000A221114B0037)
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x50c8b30 (key: 0xA000DD7437090027)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c8df10 (key: 0xA000EF2705DB0040)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x436bdf0 (key: 0xA000EF2705DC0036)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5895390 (key: 0xA000A221114C0009)

raffenet · 2024-11-06T20:34:56Z

libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441

Hm, OK. The warning is from the RMA origin side. So we need to trace keys from creation at the target -> usage at the origin and see if there's corruption or something.

colleeneb added the aurora label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang in one-sided communications across multiple nodes #7118

Hang in one-sided communications across multiple nodes #7118

victor-anisimov commented Aug 27, 2024 •

edited

Loading

hzhou commented Aug 28, 2024 •

edited

Loading

victor-anisimov commented Aug 28, 2024 •

edited

Loading

hzhou commented Aug 29, 2024 •

edited

Loading

hzhou commented Aug 29, 2024 •

edited

Loading

hzhou commented Aug 29, 2024

victor-anisimov commented Aug 29, 2024

raffenet commented Aug 29, 2024 •

edited

Loading

hzhou commented Aug 29, 2024

victor-anisimov commented Aug 30, 2024 •

edited

Loading

hzhou commented Aug 30, 2024

colleeneb commented Aug 30, 2024

hzhou commented Aug 30, 2024

iziemba commented Nov 6, 2024

raffenet commented Nov 6, 2024 •

edited

Loading

raffenet commented Nov 6, 2024

raffenet commented Nov 6, 2024

raffenet commented Nov 6, 2024 •

edited

Loading

raffenet commented Nov 6, 2024

iziemba commented Nov 6, 2024 •

edited

Loading

raffenet commented Nov 6, 2024

Hang in one-sided communications across multiple nodes #7118

Hang in one-sided communications across multiple nodes #7118

Comments

victor-anisimov commented Aug 27, 2024 • edited Loading

hzhou commented Aug 28, 2024 • edited Loading

victor-anisimov commented Aug 28, 2024 • edited Loading

hzhou commented Aug 29, 2024 • edited Loading

hzhou commented Aug 29, 2024 • edited Loading

hzhou commented Aug 29, 2024

victor-anisimov commented Aug 29, 2024

raffenet commented Aug 29, 2024 • edited Loading

hzhou commented Aug 29, 2024

victor-anisimov commented Aug 30, 2024 • edited Loading

hzhou commented Aug 30, 2024

colleeneb commented Aug 30, 2024

hzhou commented Aug 30, 2024

iziemba commented Nov 6, 2024

raffenet commented Nov 6, 2024 • edited Loading

raffenet commented Nov 6, 2024

raffenet commented Nov 6, 2024

raffenet commented Nov 6, 2024 • edited Loading

raffenet commented Nov 6, 2024

iziemba commented Nov 6, 2024 • edited Loading

raffenet commented Nov 6, 2024

victor-anisimov commented Aug 27, 2024 •

edited

Loading

hzhou commented Aug 28, 2024 •

edited

Loading

victor-anisimov commented Aug 28, 2024 •

edited

Loading

hzhou commented Aug 29, 2024 •

edited

Loading

hzhou commented Aug 29, 2024 •

edited

Loading

raffenet commented Aug 29, 2024 •

edited

Loading

victor-anisimov commented Aug 30, 2024 •

edited

Loading

raffenet commented Nov 6, 2024 •

edited

Loading

raffenet commented Nov 6, 2024 •

edited

Loading

iziemba commented Nov 6, 2024 •

edited

Loading