Skip to content

libfabric "patch" routines corrupt the procedure linkage table (PLT) on AARCH64 under Cray MPICH MPI_Init  #11451

@paboyle

Description

@paboyle

Describe the bug

During MPI_Init call on ALPS at CSCS Switzerland, AARCH64 NVIDIA Grace/H200 libfabric "patch" routines corrupt
the procedure linkage table (PLT).

To Reproduce

Call MPI_Init but symptom can be non-obvious and deferred.

Expected behavior
Program runs to completion.

Output
MPI_Comm_dup fails, after MPI_Init, with asm level single step showing the step through the
dylib PLT has the "wrong" instruction br x15 at the entry point for MPI_Comm_dup in PLT.

Putting a watch point on the address of the PLT entry and running call to MPI_Init traps the PLT being
corrupted by a call stack I will embed below.

Environment:
Linux AARCH64 NVIDIA Grace/H200 superchip node with HPE Slingshot.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions