-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
Problem: PMI port allocation is likely to result in occasional failures
If the ports allocated to Cray PMI by flux are not bindable, the failure will look similar to this:
61.065s: job.exception fA8LTxDqR type=exec severity=0 hello: rank 2 on host elcap4367 exited and exit-timeout=30s has expired
flux-job: task(s) Unknown signal 127
Fri Mar 14 11:56:06 2025: [PE_2]:inet_listen_socket_setup:bind() failed [fd=3, port=1371 err='Address already in use']
Fri Mar 14 11:56:06 2025: [PE_2]:_pmi_inet_listen_socket_setup:socket setup failed
Fri Mar 14 11:56:06 2025: [PE_2]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
MPICH ERROR [Rank 0] [job id unknown] [Fri Mar 14 11:56:06 2025] [elcap4367] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(455).......:
MPIR_pmi_init(110)...: PMI_Init returned 1
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(455).......:
MPIR_pmi_init(110)...: PMI_Init returned 1
Fri Mar 14 11:56:06 2025: [PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=1371 err='Address already in use']
Fri Mar 14 11:56:06 2025: [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
Fri Mar 14 11:56:06 2025: [PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
MPICH ERROR [Rank 0] [job id unknown] [Fri Mar 14 11:56:06 2025] [elcap4366] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(455).......:
MPIR_pmi_init(110)...: PMI_Init returned 1
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170):
MPID_Init(455).......:
MPIR_pmi_init(110)...: PMI_Init returned 1
61.064s: flux-shell[0]: FATAL: doom: hello: rank 2 on host elcap4367 exited and exit-timeout=30s has expired
This can be simulated by running multi-node jobs that share nodes, without the port allocator jobtap module loaded, so PMI_CONTROL_PORT is not set in Cray PMI's environment, and all jobs use the same default control port (1371 apparently)
I wanted to get an issue open so that the error users would see is searchable here.
Since port allocator a) is not "global" to the node, and b) assumes a range of ports are free without checking, it should be possible to hit this even on systems deployed with node-exclusive scheduling at the system level.
Metadata
Metadata
Assignees
Labels
No labels