Skip to content

Me (and others running large jobs on Perlmutter) consistently see our jobs die at random times with the same error message. #11307

@jjepson2

Description

@jjepson2

Me (and others running large jobs on Perlmutter) consistently see our jobs die at random times with the same error message:

MPICH ERROR [Rank 11122] [job id 41595204.0] [Thu Aug 14 08:27:11 2025] [nid006700] - Abort(2695311) (rank 11122 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack:
PMPI_Allreduce(534)..................: MPI_Allreduce(sbuf=0x7ffd73e91548, rbuf=0x7ffd73e91408, count=1, datatype=MPI_DOUBLE_PRECISION, op=MPI_SUM, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(519)..................:
MPIR_CRAY_Allreduce(588).............:
MPIDI_Cray_shared_mem_coll_bcast(494):
MPIDI_Progress_test(97)..............:
MPIDI_OFI_handle_cq_error(1075)......: OFI poll failed (ofi_events.c:1077:MPIDI_OFI_handle_cq_error:Input/output error - CANCELED)

I was wondering if there were any thoughts as to what could be causing this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions