Skip to content

Deadlock with MPIManager and custom worker pool #32

Open
@marius311

Description

@marius311

I'm trying to use a custom worker pool (with the goal of using the master process as a worker too, so as not to waste a GPU) but getting a deadlock in this package. Unfortuantely I can't get a MWE, but schematically the MWE looks something like this (although this itself doesn't seem to trigger it):

# myscript.jl
using MPI, MPIClusterManagers, Distributed

MPI.Init()
MPIClusterManagers.start_main_loop(MPI_TRANSPORT_ALL) # or TCP_TRANSPORT_ALL, doesn't matter

pool = WorkerPool(procs())

for i = 1:100
    pmap(pool, 1:10) do j
        # ...
    end
end

and then mpiexec -n 4 julia myscript.jl. In my real case, I always deadlock somewhere before i = 100. Interrupting a worker I can retrieve this stack trace:

signal (15): Terminated
in expression starting at /global/u1/m/marius/work/pipelineB2/scripts/bk18_fwdsim2_nodust.jl:68
jl_pgcstack_addr_static at /buildworker/worker/package_linux64/build/cli/loader_exe.c:14
ctx_switch at /buildworker/worker/package_linux64/build/src/task.c:398
jl_switch at /buildworker/worker/package_linux64/build/src/task.c:502
try_yieldto at ./task.jl:767
wait at ./task.jl:837
yield at ./task.jl:721
receive_event_loop at /global/homes/m/marius/.julia/packages/MPIClusterManagers/TTxqG/src/mpimanager.jl:430
#20 at ./task.jl:423
unknown function (ip: 0x7ff690147f7f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:877
unknown function (ip: (nil))
Allocations: 766929758 (Pool: 766247147; Big: 682611); GC: 1133

Any ideas what could be going on? Julia v1.7.2, MPI v0.19.2, MPIClusterManagers v0.2.1, OpenMPI, v4.0.5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions