Description
I have noticed for a while now, that when my computations don't work (i.e.a worker die) often times regardless of my biggest effort to kill a cluster, i keep seeing zombie clusters in my machine.
i.e.:
- I do
cluster.close()
- I restart the kernel
- I run in another notebook:
from dask_gateway import Gateway
g = Gateway()
g.list_clusters()
and the cluster is still there.
Usually I try to scale it down to 0, so at least I am not using anything, but the cluster stays there.
Today I kept having issues with my clusters - probably due to what I want to do - and I had 4 zombie clusters (that I managed to scale down to 0 by connecting to each of them through cluster = g.connect(g.list_clusters()[i].name)
), so I decided to restart my server entirely.
I went on my home page, pressed stop server, and restarted it.
And on the new server I could still list the zombie clusters with
g.list_clusters()
They all have 0 workers and cores, but they are there, and I think they still can take memory just by existing there.
After a while - i guess after whatever timeout limit is in place- they disappear.