Describe the bug
During MPI_Init call on ALPS at CSCS Switzerland, AARCH64 NVIDIA Grace/H200 libfabric "patch" routines corrupt
the procedure linkage table (PLT).
To Reproduce
Call MPI_Init but symptom can be non-obvious and deferred.
Expected behavior
Program runs to completion.
Output
MPI_Comm_dup fails, after MPI_Init, with asm level single step showing the step through the
dylib PLT has the "wrong" instruction br x15 at the entry point for MPI_Comm_dup in PLT.
Putting a watch point on the address of the PLT entry and running call to MPI_Init traps the PLT being
corrupted by a call stack I will embed below.
Environment:
Linux AARCH64 NVIDIA Grace/H200 superchip node with HPE Slingshot.
Additional context