Skip to content

Workers killed by signal 6/9 and timeout errors during cluster shutdown with LocalCUDACluster #9100

@leekaimao

Description

@leekaimao

I'm using dask-cuda's LocalCUDACluster for GPU-based distributed computing in a Python script. While the computation completes successfully, I encounter multiple errors during the shutdown phase.

Specifically, after calling cluster.close() and attempting to gracefully shut down the Dask cluster, I see repeated logs like:

distributed.nanny - INFO - Worker process XXX was killed by signal 6
...
distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
...
distributed.nanny - INFO - Worker process XXX was killed by signal 9

Additionally, I get a traceback indicating a TimeoutError during internal cluster state correction:

tornado.application - ERROR - Exception in callback ...
TimeoutError

And finally, a memory-related error from tcmalloc:

src/tcmalloc.cc:284] Attempt to free invalid pointer 0x...

Environment Setup:

  • Using LocalCUDACluster with explicit GPU device configuration.
  • Disabled Dask optimizations (optimization.fuse.active=False) and set conservative memory thresholds.
  • Workers are configured with device_memory_limit="80GB" and threads_per_worker=1.
  • Client and cluster are manually closed at the end of execution.

Code Snippet:

dask.config.set({"optimization.fuse.active": False})
dask.config.set({
    "distributed.worker.memory.target": 0.6,
    "distributed.worker.memory.spill": 0.7,
    "distributed.worker.memory.pause": 0.8,
    "distributed.worker.memory.terminate": 0.9,
    "distributed.comm.timeouts.connect": "300s",
    "distributed.comm.timeouts.tcp": "300s",
    "distributed.worker.daemon": False,
    "distributed.nanny.timeout": "60s"
})

cluster = LocalCUDACluster(
    CUDA_VISIBLE_DEVICES=cuda_devices,
    device_memory_limit="80GB",
    n_workers=n_workers,
    threads_per_worker=1,
    dashboard_address=':0',
    jit_unspill=False,
    silence_logs=False
)

client = Client(cluster, timeout='60s')
client.wait_for_workers(n_workers, timeout=120)

# ... computation ...

cluster.close(timeout=300)

Expected Behavior:

Graceful shutdown of workers and scheduler without force-killing or timeout errors.

Actual Behavior:

Workers are terminated forcefully with signals 6 and 9, followed by timeout and memory-related errors during shutdown.

Environment:

  • Dask version: 2024.12.1
  • Dask-CUDA version: 25.2.0
  • Python version: 3.12
  • OS: Linux (assumed)
  • Relevant packages: cudf, cupy, torch, distributed, etc.

Question:

Is this expected behavior? Are there additional configurations or best practices to ensure clean shutdown of GPU clusters in Dask?

Any help or guidance would be greatly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions