Skip to content

Add Grafana Dashboard for KubeRay Operator #3676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 29, 2025

Conversation

win5923
Copy link
Contributor

@win5923 win5923 commented May 24, 2025

Why are these changes needed?

https://docs.google.com/document/d/1zNiE7lVZYjhrxlTbh1UXOVpR6hh1GIeSfCfE9Lt5v6Y/edit?tab=t.0#heading=h.6wldsmnpunmt

Add a Grafana Dashboard for the KubeRay Operator. It enables the SRE team and platform engineers to monitor the performance for Controller-Runtime, and status of RayCluster, RayService and Rayjob CRs running on Kubernetes in real time.

It relies on KubeRay v1.4.0 or later and kube-state-metrics.

Since KubeRay 1.4.0 has not been officially released yet, you can follow the instructions in DEVELOPMENT.md to build and deploy the latest version of the KubeRay Operator to your cluster.

Feedback and suggestions are welcome!

Controller runtime:

image

RayCluster:

image

RayService:

image

Rayjob:

image

Related issue number

#3171

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@win5923 win5923 force-pushed the operator/grafana branch 2 times, most recently from 2a7c4a1 to db1751e Compare May 24, 2025 13:15
@win5923 win5923 force-pushed the operator/grafana branch from db1751e to 313dfa4 Compare May 24, 2025 13:15
@win5923 win5923 marked this pull request as ready for review May 24, 2025 13:25
@win5923 win5923 marked this pull request as draft May 24, 2025 13:29
Copy link
Contributor

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small problem:
image

NAME                      DESIRED WORKERS   AVAILABLE WORKERS   CPUS    MEMORY   GPUS   STATUS   AGE
raycluster-kuberay        1                 1                   2       3G       0      ready    8m3s
rayjob-sample-m9bd7       1                 1                   400m    0        0      ready    5m33s
rayservice-sample-vs9mm   1                 1                   2500m   4Gi      0      ready    5d19h

I found that if we had rayCluster finished provision before we setup observability stack, that rayCluster won't appear in provision second panel.

But It will show in the number of Namespace Resource Statistics panel, which makes it inconsistent.
In my case, user can't get the info of rayservice-sample-vs9mm.
This might not be a common scenario though.

@owenowenisme
Copy link
Contributor

owenowenisme commented May 25, 2025

Besides that, nice work!

@win5923
Copy link
Contributor Author

win5923 commented May 25, 2025

I found that if we had rayCluster finished provision before we setup observability stack, that rayCluster won't appear in provision second panel.

Sorry, I couldn't reproduce. I think it's because rayservice-sample-vs9mm doesn't have the kuberay_cluster_provisioned_duration_seconds metric. Since the panel uses an inner join, clusters without this metric won't appear.
image

@win5923 win5923 marked this pull request as ready for review May 27, 2025 06:07
@troychiu
Copy link
Contributor

@kevin85421 I think we can merge this first and iterate on it. Thank you!

@kevin85421 kevin85421 merged commit dcb97ce into ray-project:master May 29, 2025
24 checks passed
@win5923 win5923 deleted the operator/grafana branch May 29, 2025 06:14
pawelpaszki pushed a commit to opendatahub-io/kuberay that referenced this pull request Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants