-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable servers #111
Comments
While you were actively using your server via your browser etc, not just running a script in a terminal over the night etc, your server is shut down and soon you see that jupyterhub gives you a "start server" button / choice on what server type to start? What kind of server have you chosen? Was it a shared 1/16th, shared 1/4th, or dedicated server of some size? |
Yes, while running simulations directly from VSCode (i.e. active development and launching simulations). First I see that the VSCode connection has dropped, then I go to the launch pad or to a terminal and it asks to restart the server. I'm using a dedicated "massive" machine. |
@JordiBolibar are you using SSH from your local computer or VSCode from your local computer, or are you doing everything from browsing hub.jupytearth.org (where you can have a terminal and use vscode). |
I'm running everything directly on the browser (including VSCode). So no SSH. |
@JordiBolibar thanks, investigating further - about when was the last server shutdown event? It can help me find relevant log entries in a sea of logs? |
This morning, between 9 and 11 am CET, I had 2 shutdowns. |
@JordiBolibar is this the named server called ODINN, or something else? |
Yes, I'm using ODINN and ODINN-2. I think this morning I only used ODINN. |
I observe this in the logs of JupyterHub, so it seems that your server is getting culled by considered to be inactive. All times below are in UTC time, so one hour behind CET.
In #105 we discussed a workaround, are these shutdowns following having applied such workaround or not? |
Well, that was to leave simulations running overnight. Here I am actively using the session. The thing is that I've had zero issues with the same working routine until some days ago. And BTW, the Jupyter notebook with the sleep command only did the trick sometimes. In the last month it stopped working and my server was getting stopped anyway. |
Hi @consideRatio. Friday everything worked super smoothly, but today things are extremely unstable again. I've had 4 shutdowns already, even when actively coding and launching simulations. The problems also included issues loading VSCode from codeserver. |
Hmmm okay so it seems you have surpassed the memory in a way that made you get kicked out by k8s, which is a quite harsh stop compared to for example having a process shut down.
The following is also relevant. I can see one culling event due to "inactivity" according to JupyterHub since
Whenever you are culled by JupyterHub, you are so because you are considered inactive by JupyterHub. By using code-server, you may bypass some mechanism that makes JupyterHub think you are active. I want to try follow that up, but there isn't a quick fix for that besides the workaround ideas suggested in #105. Besides being culled, you can get terminated for using too much memory. That can happen both by Kubernetes which is a higher level kind of control, and by linux itself inside the docker container you control, which is a lower level kind of termination. To work effectively without getting blocked by these issues I suggest:
Thank you for reporting these experiences, it helps drive development towards resolving these kinds of issues long term even though it is hard to fix them quickly without upstream changes. |
On the 64 CPU node you are running now, you should have at least 224 GB of memory. I entered the container and observed with
|
Yes, I've been trying to understand what happened yesterday. I think I was running a very memory expensive task on several nodes, so I basically blew the memory of the server. I controlled memory usage and I'm using the notebook sleep trick to keep the kernel alive, and so far it is working better. I still get the server shut down sometimes, despite these measures, but it is way less frequent. |
These last days I've been having issues with my connection to JupyterHub servers. Today I've had my server stopped twice while running simulations. All of the sudden my connection dropped and I had to start again a server. This is particularly annoying when working with Julia, since one needs to download and re-compile all the packages.
Is there a reason behind this unstable behaviour? Thanks again for your help!
The text was updated successfully, but these errors were encountered: