-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
idle nodes on gcs cluster #769
Comments
Seems like Dask Gateway and some JupyterHub pods are occupying these nodes.
The dask-gateway pods should likely be in the core node pool, along with JupyterHub. There's no reason to keep them separate I think. I can take care of that. I'm not sure about the continuous image-puller. I gather that it's a JupyterHub thing. I'm not sure what the impact of disabling it would be though. It seems to me like it shouldn't be the sole thing keeping a node from scaling down (and maybe if we fix the dask-gateway pods, it would scale down). |
Thanks for looking into this Tom! Do we need some sort of cron job that checks whether these services are running non-core nodes? |
With dask/dask-gateway#325 and dask/dask-gateway#324 we'll be able to set things up so that these pods don't run on non-core nodes in the first place. That'll need to wait for the next dask-gateway release. In the meantime, we can patch around it # file: patch.yaml
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hub.jupyter.org/node-purpose
operator: In
values:
- core $ kubectl -n staging patch deployment traefik-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml
l)"
deployment.apps/traefik-gcp-uscentral1b-staging-dask-gateway patched
$ kubectl -n staging patch deployment api-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/api-gcp-uscentral1b-staging-dask-gateway patched
$ kubectl -n staging patch deployment controller-gcp-uscentral1b-staging-dask-gateway --patch="$(cat patch.yaml)"
deployment.apps/controller-gcp-uscentral1b-staging-dask-gateway patched I've confirmed that those were moved to the default pool for staging at least, and things seem to still work. Still to do are
I'll get to those later. |
I might have broken some prometheus / grafana things (the hub should be fine)
I need to figure out what pods are actually needed per namespace for prometheus-operator to function. |
@consideRatio the GCP cluster has a node with just system pods and two $ kubectl get pod -o wide --all-namespaces | grep gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd
kube-system fluentd-gke-kv5n6 2/2 Running 0 49d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system gke-metadata-server-p8wpm 1/1 Running 0 49d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system gke-metrics-agent-px4vg 1/1 Running 0 49d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system kube-dns-7c976ddbdb-kqglx 4/4 Running 2 49d 10.37.170.162 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system kube-proxy-gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd 1/1 Running 0 68d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system netd-nbvfh 1/1 Running 0 69d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
kube-system prometheus-to-sd-9fsqg 1/1 Running 0 69d 10.128.0.108 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
prod continuous-image-puller-52hgv 1/1 Running 0 10h 10.37.170.239 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none>
staging continuous-image-puller-7fvmg 1/1 Running 0 10h 10.37.170.229 gke-pangeo-uscentral-nap-n1-highmem-4-04fd1efc-11hd <none> <none> That node is in an auto-provisioned node pool set to auto-scale down all the way to zero. I wouldn't expect the |
From https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/optimization.html:
Suggests that the continuous-image-puller isn't all that useful on it's own, and we aren't using |
Hmmm i guess if you only pull a single image, and dont have user placeholders, then its just a pod requesting no resources and can be evicted by other pods if needed. It is very harmless in latest z2jh release, and it wont block scale down. I would inspect all pods on the nodes individually with kubectl describe nodes and see what pods ran on them, and i would inspect what the cluster autoscaler status configmap were saying in the kube-system namespace |
Thanks, `kubectl describe nodes` is helpful.
Edit: Now that I've disabled the continuous image puller, these unused nodes have gained the taints
```
Taints: ToBeDeletedByClusterAutoscaler=1602078148:NoSchedule
DeletionCandidateOfClusterAutoscaler=1602077543:PreferNoSchedule
```
And now it's been autoscaled down. So I think this is the behavior we want.
|
A few more stray pods that I'll pin down to the core pool
|
This removes a few more pieces from the metrics based on prometheus-operator, which we replaced with separate prometheus and grafana charts. The dependency on nginx-ingress caused the stray pods in pangeo-data#769 (comment), which were unused.
This removes a few more pieces from the metrics based on prometheus-operator, which we replaced with separate prometheus and grafana charts. The dependency on nginx-ingress caused the stray pods in #769 (comment), which were unused.
Leaving a note here for future debugging. I noticed that the node
I see a
So there's a system pod that was added to the highmemory pool. Ideally those would be in the core pool. I'll see if I can add an annotation to it. |
Hmm, according to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-to-set-pdbs-to-enable-ca-to-move-kube-system-pods
We're probably OK with that. I wonder if defining a PDB is better than (somehow?) setting the nodeAffinity so that it ends up in the core pool in the first place? We would want the affinity regardless so that it doesn't bounce between non-core nodes. |
I randomly logged into the google cloud console to monitor our cluster tonight. I found that the cluster was scaled up to 8 nodes / 34 vCPUs / 170 GB memory.
However, afaict there are only two jupyter users logged in:
I poked around the nodepools, and the nodes seemed to be heavily undersubscribed.
This is as far as my debugging skills go. I don't know how to figure out what pods are running on those nodes. I wish the elastic nodepools would scale down. Maybe there are some permanent services whose pods got stuck on those nodes and now they can't be scaled down?
This is important because it costs a lot of money to have these VMs constantly running.
The text was updated successfully, but these errors were encountered: