-
-
Notifications
You must be signed in to change notification settings - Fork 740
Description
I'm using dask-cuda
's LocalCUDACluster
for GPU-based distributed computing in a Python script. While the computation completes successfully, I encounter multiple errors during the shutdown phase.
Specifically, after calling cluster.close()
and attempting to gracefully shut down the Dask cluster, I see repeated logs like:
distributed.nanny - INFO - Worker process XXX was killed by signal 6
...
distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
...
distributed.nanny - INFO - Worker process XXX was killed by signal 9
Additionally, I get a traceback indicating a TimeoutError
during internal cluster state correction:
tornado.application - ERROR - Exception in callback ...
TimeoutError
And finally, a memory-related error from tcmalloc:
src/tcmalloc.cc:284] Attempt to free invalid pointer 0x...
Environment Setup:
- Using
LocalCUDACluster
with explicit GPU device configuration. - Disabled Dask optimizations (
optimization.fuse.active=False
) and set conservative memory thresholds. - Workers are configured with
device_memory_limit="80GB"
andthreads_per_worker=1
. - Client and cluster are manually closed at the end of execution.
Code Snippet:
dask.config.set({"optimization.fuse.active": False})
dask.config.set({
"distributed.worker.memory.target": 0.6,
"distributed.worker.memory.spill": 0.7,
"distributed.worker.memory.pause": 0.8,
"distributed.worker.memory.terminate": 0.9,
"distributed.comm.timeouts.connect": "300s",
"distributed.comm.timeouts.tcp": "300s",
"distributed.worker.daemon": False,
"distributed.nanny.timeout": "60s"
})
cluster = LocalCUDACluster(
CUDA_VISIBLE_DEVICES=cuda_devices,
device_memory_limit="80GB",
n_workers=n_workers,
threads_per_worker=1,
dashboard_address=':0',
jit_unspill=False,
silence_logs=False
)
client = Client(cluster, timeout='60s')
client.wait_for_workers(n_workers, timeout=120)
# ... computation ...
cluster.close(timeout=300)
Expected Behavior:
Graceful shutdown of workers and scheduler without force-killing or timeout errors.
Actual Behavior:
Workers are terminated forcefully with signals 6 and 9, followed by timeout and memory-related errors during shutdown.
Environment:
- Dask version:
2024.12.1
- Dask-CUDA version:
25.2.0
- Python version: 3.12
- OS: Linux (assumed)
- Relevant packages:
cudf
,cupy
,torch
,distributed
, etc.
Question:
Is this expected behavior? Are there additional configurations or best practices to ensure clean shutdown of GPU clusters in Dask?
Any help or guidance would be greatly appreciated!