Open
Description
I tried to use MPI.jl for connecting different computing nodes, and I found that there is no options for specifying the hosts.
We can specify different remote hosts with mpiexec -hosts localhost,node1,node2 ...
The default value of mpiexec_cmd
parameter in MPIManager
is mpiexec -np $np
, so I tried mpiexec -np 2 -hosts node2,node3
, and I got this error below.
julia> using MPI
julia> manager = MPIManager(np = 2, mpirun_cmd =
`mpiexec -np 2 -hosts node2,node3`)
MPI.MPIManager(np=2,launched=false,mode=MPI_ON_WORKERS)
julia> addprocs(manager)
ERROR: connect: host is unreachable (EHOSTUNREACH)
Stacktrace:
[1] try_yieldto(::Base.##296#297{Task}, ::Task) at ./event.jl:189
[2] wait() at ./event.jl:234
[3] wait(::Condition) at ./event.jl:27
[4] stream_wait(::TCPSocket, ::Condition, ::Vararg{Condition,N} where N) at ./stream.jl:42
[5] wait_connected(::TCPSocket) at ./stream.jl:258
[6] connect at ./stream.jl:983 [inlined]
[7] connect(::IPv4, ::Int64) at ./socket.jl:738
[8] setup_worker(::Int64, ::Int64, ::Symbol) at /home/alkorang/.julia/v0.6/MPI/src/cman.jl:197
[proxy:0:0@node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@node2] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@node2] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@node2] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@node2] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
Error in MPI launch ErrorException("Timeout -- the workers did not connect to the manager")
ERROR (unhandled task failure): Timeout -- the workers did not connect to the manager
ERROR: Timeout -- the workers did not connect to the manager
Stacktrace:
[1] wait(::Task) at ./task.jl:184
[2] #addprocs_locked#30(::Array{Any,1}, ::Function, ::MPI.MPIManager) at ./distributed/cluster.jl:361
[3] #addprocs#29(::Array{Any,1}, ::Function, ::MPI.MPIManager) at ./distributed/cluster.jl:319
[4] addprocs(::MPI.MPIManager) at ./distributed/cluster.jl:315
I exectued this code on node2, and there is nothing wrong with the firewall settings and path because MPI program in C runs without any error.
I used CentOS 7, MPICH 3, Julia 0.6.0, and the same path on both nodes.
Metadata
Metadata
Assignees
Labels
No labels