Skip to content

[PE_0]: inet_listen_socket_setup:bind() failed [fd=3, port=1371 err='Address already in use'] #323

@garlick

Description

@garlick

Problem: PMI port allocation is likely to result in occasional failures

If the ports allocated to Cray PMI by flux are not bindable, the failure will look similar to this:

61.065s: job.exception fA8LTxDqR type=exec severity=0 hello: rank 2 on host elcap4367 exited and exit-timeout=30s has expired
flux-job: task(s) Unknown signal 127
Fri Mar 14 11:56:06 2025: [PE_2]:inet_listen_socket_setup:bind() failed [fd=3, port=1371 err='Address already in use']
Fri Mar 14 11:56:06 2025: [PE_2]:_pmi_inet_listen_socket_setup:socket setup failed
Fri Mar 14 11:56:06 2025: [PE_2]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
MPICH ERROR [Rank 0] [job id unknown] [Fri Mar 14 11:56:06 2025] [elcap4367] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170): 
MPID_Init(455).......: 
MPIR_pmi_init(110)...: PMI_Init returned 1

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170): 
MPID_Init(455).......: 
MPIR_pmi_init(110)...: PMI_Init returned 1
Fri Mar 14 11:56:06 2025: [PE_0]:inet_listen_socket_setup:bind() failed [fd=3, port=1371 err='Address already in use']
Fri Mar 14 11:56:06 2025: [PE_0]:_pmi_inet_listen_socket_setup:socket setup failed
Fri Mar 14 11:56:06 2025: [PE_0]:control_nets_listen:_pmi_inet_listen_socket_setup (full) returned -1
MPICH ERROR [Rank 0] [job id unknown] [Fri Mar 14 11:56:06 2025] [elcap4366] - Abort(1092879) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170): 
MPID_Init(455).......: 
MPIR_pmi_init(110)...: PMI_Init returned 1

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170): 
MPID_Init(455).......: 
MPIR_pmi_init(110)...: PMI_Init returned 1
61.064s: flux-shell[0]: FATAL: doom: hello: rank 2 on host elcap4367 exited and exit-timeout=30s has expired

This can be simulated by running multi-node jobs that share nodes, without the port allocator jobtap module loaded, so PMI_CONTROL_PORT is not set in Cray PMI's environment, and all jobs use the same default control port (1371 apparently)

I wanted to get an issue open so that the error users would see is searchable here.

Since port allocator a) is not "global" to the node, and b) assumes a range of ports are free without checking, it should be possible to hit this even on systems deployed with node-exclusive scheduling at the system level.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions