Open
Description
I'm trying to use a custom worker pool (with the goal of using the master process as a worker too, so as not to waste a GPU) but getting a deadlock in this package. Unfortuantely I can't get a MWE, but schematically the MWE looks something like this (although this itself doesn't seem to trigger it):
# myscript.jl
using MPI, MPIClusterManagers, Distributed
MPI.Init()
MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL) # or TCP_TRANSPORT_ALL, doesn't matter
pool = WorkerPool(procs())
for i = 1:100
pmap(pool, 1:10) do j
# ...
end
end
and then mpiexec -n 4 julia myscript.jl
. In my real case, I always deadlock somewhere before i = 100
. Interrupting a worker I can retrieve this stack trace:
signal (15): Terminated
in expression starting at /global/u1/m/marius/work/pipelineB2/scripts/bk18_fwdsim2_nodust.jl:68
jl_pgcstack_addr_static at /buildworker/worker/package_linux64/build/cli/loader_exe.c:14
ctx_switch at /buildworker/worker/package_linux64/build/src/task.c:398
jl_switch at /buildworker/worker/package_linux64/build/src/task.c:502
try_yieldto at ./task.jl:767
wait at ./task.jl:837
yield at ./task.jl:721
receive_event_loop at /global/homes/m/marius/.julia/packages/MPIClusterManagers/TTxqG/src/mpimanager.jl:430
#20 at ./task.jl:423
unknown function (ip: 0x7ff690147f7f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
unknown function (ip: (nil))
Allocations: 766929758 (Pool: 766247147; Big: 682611); GC: 1133
Any ideas what could be going on? Julia v1.7.2, MPI v0.19.2, MPIClusterManagers v0.2.1, OpenMPI, v4.0.5.
Metadata
Metadata
Assignees
Labels
No labels