Me (and others running large jobs on Perlmutter) consistently see our jobs die at random times with the same error message:
MPICH ERROR [Rank 11122] [job id 41595204.0] [Thu Aug 14 08:27:11 2025] [nid006700] - Abort(2695311) (rank 11122 in comm 0): Fatal error in PMPI_Allreduce: Other MPI error, error stack:
PMPI_Allreduce(534)..................: MPI_Allreduce(sbuf=0x7ffd73e91548, rbuf=0x7ffd73e91408, count=1, datatype=MPI_DOUBLE_PRECISION, op=MPI_SUM, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(519)..................:
MPIR_CRAY_Allreduce(588).............:
MPIDI_Cray_shared_mem_coll_bcast(494):
MPIDI_Progress_test(97)..............:
MPIDI_OFI_handle_cq_error(1075)......: OFI poll failed (ofi_events.c:1077:MPIDI_OFI_handle_cq_error:Input/output error - CANCELED)
I was wondering if there were any thoughts as to what could be causing this.