-
Notifications
You must be signed in to change notification settings - Fork 554
Add Grafana Dashboard for KubeRay Operator #3676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2a7c4a1
to
db1751e
Compare
Signed-off-by: win5923 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE
raycluster-kuberay 1 1 2 3G 0 ready 8m3s
rayjob-sample-m9bd7 1 1 400m 0 0 ready 5m33s
rayservice-sample-vs9mm 1 1 2500m 4Gi 0 ready 5d19h
I found that if we had rayCluster finished provision before we setup observability stack, that rayCluster won't appear in provision second panel.
But It will show in the number of Namespace Resource Statistics panel, which makes it inconsistent.
In my case, user can't get the info of rayservice-sample-vs9mm.
This might not be a common scenario though.
Besides that, nice work! |
…operator/grafana
Signed-off-by: win5923 <[email protected]>
@kevin85421 I think we can merge this first and iterate on it. Thank you! |
Why are these changes needed?
https://docs.google.com/document/d/1zNiE7lVZYjhrxlTbh1UXOVpR6hh1GIeSfCfE9Lt5v6Y/edit?tab=t.0#heading=h.6wldsmnpunmt
Add a Grafana Dashboard for the KubeRay Operator. It enables the SRE team and platform engineers to monitor the performance for
Controller-Runtime
, and status ofRayCluster
,RayService
andRayjob
CRs running on Kubernetes in real time.It relies on KubeRay v1.4.0 or later and kube-state-metrics.
Since KubeRay 1.4.0 has not been officially released yet, you can follow the instructions in DEVELOPMENT.md to build and deploy the latest version of the KubeRay Operator to your cluster.
To set up the required monitoring stack, follow the official guide:
🔗 https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/prometheus-grafana.html
To enable metric scraping from the KubeRay Operator, you can modify the ServiceMonitor and apply it to your cluster.
Feedback and suggestions are welcome!
Controller runtime:
RayCluster:
RayService:
Rayjob:
Related issue number
#3171
Checks