Workers killed by signal 6/9 and timeout errors during cluster shutdown with `LocalCUDACluster`


I'm using `dask-cuda`'s `LocalCUDACluster` for GPU-based distributed computing in a Python script. While the computation completes successfully, I encounter multiple errors during the shutdown phase.

Specifically, after calling `cluster.close()` and attempting to gracefully shut down the Dask cluster, I see repeated logs like:

```
distributed.nanny - INFO - Worker process XXX was killed by signal 6
...
distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
...
distributed.nanny - INFO - Worker process XXX was killed by signal 9
```

Additionally, I get a traceback indicating a `TimeoutError` during internal cluster state correction:

```
tornado.application - ERROR - Exception in callback ...
TimeoutError
```

And finally, a memory-related error from tcmalloc:

```
src/tcmalloc.cc:284] Attempt to free invalid pointer 0x...
```

**Environment Setup:**

- Using `LocalCUDACluster` with explicit GPU device configuration.
- Disabled Dask optimizations (`optimization.fuse.active=False`) and set conservative memory thresholds.
- Workers are configured with `device_memory_limit="80GB"` and `threads_per_worker=1`.
- Client and cluster are manually closed at the end of execution.

**Code Snippet:**

```python
dask.config.set({"optimization.fuse.active": False})
dask.config.set({
    "distributed.worker.memory.target": 0.6,
    "distributed.worker.memory.spill": 0.7,
    "distributed.worker.memory.pause": 0.8,
    "distributed.worker.memory.terminate": 0.9,
    "distributed.comm.timeouts.connect": "300s",
    "distributed.comm.timeouts.tcp": "300s",
    "distributed.worker.daemon": False,
    "distributed.nanny.timeout": "60s"
})

cluster = LocalCUDACluster(
    CUDA_VISIBLE_DEVICES=cuda_devices,
    device_memory_limit="80GB",
    n_workers=n_workers,
    threads_per_worker=1,
    dashboard_address=':0',
    jit_unspill=False,
    silence_logs=False
)

client = Client(cluster, timeout='60s')
client.wait_for_workers(n_workers, timeout=120)

# ... computation ...

cluster.close(timeout=300)
```

**Expected Behavior:**

Graceful shutdown of workers and scheduler without force-killing or timeout errors.

**Actual Behavior:**

Workers are terminated forcefully with signals 6 and 9, followed by timeout and memory-related errors during shutdown.

**Environment:**

- Dask version: `2024.12.1`
- Dask-CUDA version: `25.2.0`
- Python version: 3.12
- OS: Linux (assumed)
- Relevant packages: `cudf`, `cupy`, `torch`, `distributed`, etc.

**Question:**

Is this expected behavior? Are there additional configurations or best practices to ensure clean shutdown of GPU clusters in Dask?

Any help or guidance would be greatly appreciated!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Workers killed by signal 6/9 and timeout errors during cluster shutdown with `LocalCUDACluster` #9100

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Workers killed by signal 6/9 and timeout errors during cluster shutdown with LocalCUDACluster #9100

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Workers killed by signal 6/9 and timeout errors during cluster shutdown with `LocalCUDACluster` #9100