Add default Ray node label info to Ray Pod environment #3699

ryanaoleary · 2025-05-27T20:32:59Z

Add default Ray node label info to Ray Pod environment

Why are these changes needed?

This PR adds market-type information for different cloud providers to the Ray pod environment based on the provided nodeSelector value. This PR also adds environment variables to pass region and zone information using downward API (kubernetes/kubernetes#127092). These environment variables will be used in Ray core to set default Ray node labels.

I'll add a comment below with my manual test results with propagating topology.k8s.io/region and topology.k8s.io/zone on a GKE v1.33 alpha cluster.

Related issue number

ray-project/ray#51564

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

ryanaoleary · 2025-05-27T20:33:17Z

cc: @MengjinYan

andrewsykim · 2025-05-28T14:00:39Z

ray-operator/controllers/ray/common/pod.go

@@ -493,6 +493,9 @@ func BuildPod(ctx context.Context, podTemplateSpec corev1.PodTemplateSpec, rayNo
 		initLivenessAndReadinessProbe(&pod.Spec.Containers[utils.RayContainerIndex], rayNodeType, creatorCRDType)
 	}

+	// add downward API environment variables for Ray default node labels
+	addDefaultRayNodeLabels(&pod)


Should we guard this logic with a Ray version check?

This code doesn't rely on any API change from Ray, it just sets some env vars and the actual node labels get set in Ray core using those vars, but I can add a version guard here (I guess for whatever version ray-project/ray#53360 is included in) if we don't want it setting any unused vars for users on older versions of Ray.

From offline discussion with @MengjinYan we were leaning towards not including a version guard, since users are not required to specify the Ray version they're using in the CR spec

MengjinYan

LTGM @kevin85421 Can take a look from Kuberay's perspective?

andrewsykim · 2025-05-28T17:32:08Z

ray-operator/controllers/ray/common/pod.go

+		pod.Spec.Containers[utils.RayContainerIndex].Env,
+		// used to set the ray.io/market-type node label
+		corev1.EnvVar{
+			Name:  "RAY_NODE_MARKET_TYPE",


Is the plan for Ray Core to check these env vars? Can you link the PR that includes this change?

Yeah that's correct - this is the related PR: ray-project/ray#53360

ryanaoleary · 2025-06-04T19:09:04Z

LTGM @kevin85421 Can take a look from Kuberay's perspective?

@kevin85421 Bumping this to see if we can include it in 1.4

kevin85421 · 2025-06-04T19:44:51Z

@ryanaoleary I chatted with @MengjinYan, and my understanding is that this doesn’t need to be included in v1.4.0. Could you sync with @MengjinYan and let me know if I’m mistaken? Thanks!

ryanaoleary · 2025-06-04T20:39:56Z

ding is that this doesn’t need to be included in v1.4.0. Could you sync with @MengjinYan and let me know if I’m mistaken? Thanks!

Synced offline with @MengjinYan and yeah there's no urgency to include this in v1.4.0, we can wait to include it in the next release. My thought was just that it'd be useful to have this functionality in the soonest stable release for testing, but I can just use the nightly image.

kevin85421 · 2025-06-04T20:48:19Z

My thought was just that it'd be useful to have this functionality in the soonest stable release for testing, but I can just use the nightly image.

This makes sense. I’ll review it for now. We’ll make a best-effort attempt to include this PR, but there’s no guarantee.

kevin85421 · 2025-06-04T20:53:36Z

ray-operator/controllers/ray/common/pod.go

+	)
+}
+
+// getRayMarketTypeFromNodeSelector is a helper function to determine the ray.io/market-type label


Is nodeSelctor enough? There are multiple methods to affect Pod scheduling such as nodeSelector, taints/tolerations, affinity, priorityClass, and so on?

I think for market type, nodeSelector is probably enough for now since it seems to be the standard method to schedule on spot instances: https://cloud.google.com/kubernetes-engine/docs/concepts/spot-vms. Of the others, I think we'd only be concerned about taints/tolerations and affinity. For nodeAffinity, we could also check for something like:

spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: cloud.google.com/gke-spot operator: In values: - "true"

and add the label if it exists. For taints/tolerations I don't think we need to check for them even if the Pod contains a toleration like:

spec: tolerations: - key: "cloud.google.com/gke-spot" operator: "Equal" value: "true" effect: "NoSchedule"

since it's not guaranteed that the Pod will schedule to a spot instance unless an affinity or nodeSelector also exists (plus it's not guaranteed the node pool will have the matching taint, so the toleration could just be ignored).

For now I'll go ahead and add a check for nodeAffinity, does the above make sense to you @MengjinYan ?

Added this check in cc581c2

Having an affinity or toleration doesn't always guarentee that the Pod is running on a spot instance. I think node selector is as close as we get, for other use-cases the user will need to override labels using rayStartParams

I guess node affinity with requiredDuringSchedulingIgnoredDuringExecution would actually, so nvm :)

ray-operator/controllers/ray/common/pod.go

kevin85421 · 2025-06-04T21:03:02Z

ray-operator/controllers/ray/common/pod.go

+
+// getRayMarketTypeFromNodeSelector is a helper function to determine the ray.io/market-type label
+// based on user-provided Kubernetes nodeSelector values.
+func getRayMarketTypeFromNodeSelector(pod *corev1.Pod) string {


I guess some schedulers or webhooks may update the node selector after the Pod is created. We should take it into consideration.

Do you mean that the value of the default labels might change after the ray node started?

might change after the ray node started?

No, I mean it may be changed after KubeRay constructs the Pod spec but before the Pod is created or scheduled.

Are there any webhooks that modify the value of cloud.google.com/gke-spot on a Pod? I'm having difficulty finding a reference. I think we should default to adding Ray node labels based on what the user specifies in their Pod spec.

ray-operator/controllers/ray/common/pod.go

andrewsykim · 2025-06-05T23:10:09Z

ray-operator/controllers/ray/common/pod.go

+			Name: utils.RayNodeZone,
+			ValueFrom: &corev1.EnvVarSource{
+				FieldRef: &corev1.ObjectFieldSelector{
+					FieldPath: fmt.Sprintf("metadata.labels['%s']", utils.K8sTopologyZoneLabel),


I wonder if we should check the existence of the label in the Pod first before referencing from downward API. Should we also check node selector for topology.kubernetes.io/zone=<zone> and use that value for RAY_NODE_ZONE?

That makes sense to me, I added logic to check the Pod Labels, then NodeSelectors, and then fallback to using downward API in 604c162 for both the region and zone.

Signed-off-by: Ryan O'Leary <[email protected]> Add default Ray node label info to Ray Pod environment Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: Ryan O'Leary <[email protected]>

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary self-assigned this May 27, 2025

MengjinYan self-assigned this May 28, 2025

ryanaoleary mentioned this pull request May 28, 2025

[Core] Add default Ray Node labels at Node init ray-project/ray#53360

Open

8 tasks

andrewsykim reviewed May 28, 2025

View reviewed changes

MengjinYan approved these changes May 28, 2025

View reviewed changes

andrewsykim reviewed May 28, 2025

View reviewed changes

ryanaoleary requested a review from andrewsykim June 3, 2025 22:03

kevin85421 reviewed Jun 4, 2025

View reviewed changes

masoudcharkhabi added the k8s-proj label Jun 5, 2025

masoudcharkhabi added this to K8s and Ray (go/k8s-ray-oss) Jun 5, 2025

masoudcharkhabi moved this to In Progress in K8s and Ray (go/k8s-ray-oss) Jun 5, 2025

ryanaoleary requested a review from kevin85421 June 5, 2025 00:51

ryanaoleary mentioned this pull request Jun 5, 2025

[Core] Ray Label Selector API Implementation Tracker ray-project/ray#51564

Open

18 tasks

andrewsykim reviewed Jun 5, 2025

View reviewed changes

ryanaoleary and others added 7 commits June 6, 2025 05:55

Add default Ray node label info to Ray Pod environment

0d0e462

Signed-off-by: Ryan O'Leary <[email protected]> Add default Ray node label info to Ray Pod environment Signed-off-by: Ryan O'Leary <[email protected]>

Use existing check container env helper

af9700b

Signed-off-by: Ryan O'Leary <[email protected]>

Remove var that's already set

1f87e2f

Signed-off-by: Ryan O'Leary <[email protected]>

Update ray-operator/controllers/ray/common/pod.go

5e135d8

Co-authored-by: Kai-Hsun Chen <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>

Fix comments

b55a223

Signed-off-by: Ryan O'Leary <[email protected]>

Check for nodeAffinity

e9bd3e1

Signed-off-by: Ryan O'Leary <[email protected]>

Check Pod Labels first and add unit tests

604c162

Signed-off-by: Ryan O'Leary <[email protected]>

ryanaoleary force-pushed the pass-worker-group-ray-label branch from daebac4 to 604c162 Compare June 6, 2025 05:56

ryanaoleary requested a review from andrewsykim June 6, 2025 06:46

Add default Ray node label info to Ray Pod environment #3699

Are you sure you want to change the base?

Add default Ray node label info to Ray Pod environment #3699

Uh oh!

Conversation

ryanaoleary commented May 27, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

ryanaoleary commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MengjinYan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary commented Jun 4, 2025

Uh oh!

kevin85421 commented Jun 4, 2025

Uh oh!

ryanaoleary commented Jun 4, 2025

Uh oh!

kevin85421 commented Jun 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewsykim Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewsykim Jun 5, 2025 •

edited

Loading

ryanaoleary Jun 6, 2025 •

edited

Loading