-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zombie clusters #914
Comments
Can anyone help me with this or tell me how to completely kill these zombie clusters? |
Small note: I was using |
Like - this cluster |
There is a lingering scheduler pod, so they are consuming some resources:
Just to confirm, does doing this do anything?
There is an idle timeout of 1800 seconds, but that's apparently not kicking in. |
It does not work - interestingly tho, I can do
and it will scale! which is the very confusing thing! And that cluster has been around for way longer than 1800 seconds - more like 2 days! thanks so much Tom - nice to see you here :) |
Weird... So you can connect to the cluster? What if you do...
In theory that's supposed to tell the scheduler and workers to shut down too.
So that's 1800 seconds of "inactivity". I was poking around to see what that is: https://github.com/dask/distributed/blob/08ea96890674d48b90f4e1f92959957e5e362a18/distributed/scheduler.py#L6362-L6379. Basically, checks if any of the workers have tasks they're working on. It'd be useful to surface this through the dashboard, but I don't think it is. It is independent of when you log out / close your server. Normally, the lifetime of the Dask cluster is tied to the lifetime of your Python kernel. We register some functions to run when the kernel is shut down telling the cluster to shut down. But if the kernel doesn't exit cleanly then those functions may not run. That's what the idle timeout is for, but it's apparently not working in this case. If you're able to, it'd be interesting to check the state of the scheduler. Maybe something like
and compare that to
I'm always following along :) |
Ok - lol - it is still there!
gave me which sounds about right 🤣
it's hanging.. i will report back if it gets somewhere. what is the difference betwee
and
I think both connect the cluster to my notebook. But in a different way.
I get this warning - it did not happen on friday
|
Those are should be about the same. I think there's an issue somewhere about standardizing them (dask-gateway used to require I think warning is safe to ignore... It's only going to happening right now because the zombie cluster was created a while ago and we updated the image in the meantime. Your client is on the new image and your zombie cluster is on the old one. |
It's been 24 min and |
still hanging after 2 hours, I would say that |
OK, I killed it.
… On Feb 1, 2021, at 1:26 PM, Chiara Lepore ***@***.***> wrote:
still hanging after 2 hours, I would say that client.shutdown() doesn't work. Feel free to kill it!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#914 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOITIOG3R4DIL3IQWA3LS4355JANCNFSM4WDAFVUQ>.
|
not sure if this is still coming up, but i've noticed on the AWS hub in a clean session i often see a pending cluster still around:
[ClusterReport<name=icesat2-staging.45dc9798954c48fca2b3580b6a6104ea, status=PENDING>] The following command gets rid of it: |
Thanks @scottyhq !!! Next time I will try this. If it consistently works we should add it to the notes. I am keeping the issue open because I plan to add some text about this issue in the documentation... at some point! |
that command killed a zombie cluster right away! great! |
I have noticed for a while now, that when my computations don't work (i.e.a worker die) often times regardless of my biggest effort to kill a cluster, i keep seeing zombie clusters in my machine.
i.e.:
cluster.close()
and the cluster is still there.
Usually I try to scale it down to 0, so at least I am not using anything, but the cluster stays there.
Today I kept having issues with my clusters - probably due to what I want to do - and I had 4 zombie clusters (that I managed to scale down to 0 by connecting to each of them through
cluster = g.connect(g.list_clusters()[i].name)
), so I decided to restart my server entirely.I went on my home page, pressed stop server, and restarted it.
And on the new server I could still list the zombie clusters with
g.list_clusters()
They all have 0 workers and cores, but they are there, and I think they still can take memory just by existing there.
After a while - i guess after whatever timeout limit is in place- they disappear.
The text was updated successfully, but these errors were encountered: