Skip to content

MPI remote machine connection #3

Open
@alkorang

Description

@alkorang

I tried to use MPI.jl for connecting different computing nodes, and I found that there is no options for specifying the hosts.
We can specify different remote hosts with mpiexec -hosts localhost,node1,node2 ...
The default value of mpiexec_cmd parameter in MPIManager is mpiexec -np $np, so I tried mpiexec -np 2 -hosts node2,node3, and I got this error below.

julia> using MPI

julia> manager = MPIManager(np = 2, mpirun_cmd =
       `mpiexec -np 2 -hosts node2,node3`)
MPI.MPIManager(np=2,launched=false,mode=MPI_ON_WORKERS)

julia> addprocs(manager)
ERROR: connect: host is unreachable (EHOSTUNREACH)
Stacktrace:
 [1] try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
 [2] wait() at ./event.jl:234
 [3] wait(::Condition) at ./event.jl:27
 [4] stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
 [5] wait_connected(::TCPSocket) at ./stream.jl:258
 [6] connect at ./stream.jl:983 [inlined]
 [7] connect(::IPv4, ::Int64) at ./socket.jl:738
 [8] setup_worker(::Int64, ::Int64, ::Symbol) at /home/alkorang/.julia/v0.6/MPI/src/cman.jl:197
[proxy:0:0@node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@node2] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node2] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node2] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@node2] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
Error in MPI launch ErrorException("Timeout -- the workers did not connect to the manager")
ERROR (unhandled task failure): Timeout -- the workers did not connect to the manager
ERROR: Timeout -- the workers did not connect to the manager
Stacktrace:
 [1] wait(::Task) at ./task.jl:184
 [2] #addprocs_locked#30(::Array{Any,1}, ::Function, ::MPI.MPIManager) at ./distributed/cluster.jl:361
 [3] #addprocs#29(::Array{Any,1}, ::Function, ::MPI.MPIManager) at ./distributed/cluster.jl:319
 [4] addprocs(::MPI.MPIManager) at ./distributed/cluster.jl:315

I exectued this code on node2, and there is nothing wrong with the firewall settings and path because MPI program in C runs without any error.

I used CentOS 7, MPICH 3, Julia 0.6.0, and the same path on both nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions