Skip to content

Conversation

mtian29
Copy link
Contributor

@mtian29 mtian29 commented Oct 1, 2025

Why are these changes needed?

Support for Volcano Network Topology Aware Scheduling

Close issue. #3641

Test

Build a image and use this image for kuberay operator

Apply a raycluster with labels

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: kwok-raycluster-h100-q21-low
  labels:
    ray.io/scheduler-name: volcano 
    volcano.sh/queue-name: queue2 
    ray.io/priority-class-name: ml-tier1 
    volcano.sh/network-topology-mode: hard # <----
    volcano.sh/network-topology-highest-tier-allowed: "1" # <-----
.....

The volcano podgroup has the field injected

❯ kg pg -oyaml
apiVersion: v1
items:
- apiVersion: scheduling.volcano.sh/v1beta1
  kind: PodGroup
  metadata:
    annotations:
      volcano.sh/job-allocated-hypernode: hypernode-ad1
    creationTimestamp: "2025-09-30T22:40:12Z"
    generation: 3
    name: ray-kwok-raycluster-h100-q21-low-pg
    namespace: kuberay
.....
  spec:
    minMember: 10
    minResources:
      cpu: "80"
      memory: 80Gi
      nvidia.com/h100: "80"
    networkTopology:
      highestTierAllowed: 1 <--- 
      mode: hard <----- 

Some other combinations of labels and result.

volcano.sh/network-topology-mode: soft
volcano.sh/network-topology-highest-tier-allowed: "2"

networkTopology:
  highestTierAllowed: 2
  mode: soft


volcano.sh/network-topology-mode: hard
volcano.sh/network-topology-highest-tier-allowed: "2"

networkTopology:
  highestTierAllowed: 2
  mode: hard


volcano.sh/network-topology-mode: hard

networkTopology:
  highestTierAllowed: 1 <— default
  mode: hard

——
volcano.sh/network-topology-mode: soft

networkTopology:
  highestTierAllowed: 1 <— default
  mode: soft

——

no label => no network topology

spec:
minMember: 20
minResources:
cpu: "160"
memory: 160Gi
nvidia.com/h100: "160"
priorityClassName: ml-tier1
queue: queue2

—-

volcano.sh/network-topology-mode: soft
volcano.sh/network-topology-highest-tier-allowed: "abc"

"PodGroup.Error":"failed to convert volcano.sh/network-topology-highest-tier-allowed label to int: strconv.Atoi: parsing "abc": invalid syntax for podgroup ray-kwok-raycluster-h100-q2-soft-topology2-pg in namespace kuberay

Related issue number

Closes #3641

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@mtian29 mtian29 changed the title Support for Volcano Network Topology Aware Scheduling [Feature] Support for Volcano Network Topology Aware Scheduling Oct 1, 2025
@mtian29 mtian29 changed the title [Feature] Support for Volcano Network Topology Aware Scheduling [Feature] Support for Volcano Network Topology Aware Scheduling for kuberay Oct 1, 2025
mode, modeOk := app.ObjectMeta.Labels[NetworkTopologyModeLabelKey]
highestTier, tierOk := app.ObjectMeta.Labels[NetworkTopologyHighestTierAllowedLabelKey]
if modeOk && tierOk {
highestTierInt, err := strconv.Atoi(highestTier)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle this error so that the user will get a better understanding of what happened?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should. Changed my PR. Thanks for reviewing.


mode, modeOk := app.ObjectMeta.Labels[NetworkTopologyModeLabelKey]
highestTier, tierOk := app.ObjectMeta.Labels[NetworkTopologyHighestTierAllowedLabelKey]
if modeOk && tierOk {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to Network Topology Aware Scheduling Policy, the highestTierAllowed is not required if mode is soft. If the highestTierAllowed is not set with soft mode, the NetworkTopologySpec would not be set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch.
Changed my PR.

Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should synchronize the NetworkTopology field for PodGroup when updating an existing RayCluster labels. WDYT?

@mtian29
Copy link
Contributor Author

mtian29 commented Oct 1, 2025

I think we should synchronize the NetworkTopology field for PodGroup when updating an existing RayCluster labels. WDYT?

Thanks. @win5923 Good point.

  • Where should we do this for update? I don't see a place to do it.
  • But even we can, we don't need to do it. Network topology is for scheduling, once a pod is bind to a node and initializing, changing labels/spec/network topology won't move the pod to a different node.

@mtian29 mtian29 changed the title [Feature] Support for Volcano Network Topology Aware Scheduling for kuberay [Feature] Support Volcano Network Topology Aware Scheduling for kuberay Oct 1, 2025
@win5923
Copy link
Collaborator

win5923 commented Oct 2, 2025

Thanks. @win5923 Good point.

Where should we do this for update? I don't see a place to do it.
But even we can, we don't need to do it. Network topology is for scheduling, once a pod is bind to a node and initializing, changing labels/spec/network topology won't move the pod to a different node.

We can do this in syncPodGroup, but I think you’re right.
This is only for scheduling. If users want to change the topology settings, they should recreate the RayCluster.

Copy link
Collaborator

@win5923 win5923 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution, the changes look good to me.
But i think we should wait until #3972
is merged, as it includes the interface changes that might impact this implementation.

@mtian29
Copy link
Contributor Author

mtian29 commented Oct 3, 2025

Thanks for the contribution, the changes look good to me. But i think we should wait until #3972 is merged, as it includes the interface changes that might impact this implementation.

Thanks @win5923
The only impacted part should be app.ObjectMeta.Labels[xxxxx] => owner.GetLabels()[xxxx]

Do you know when is the next release of the kuberay operator. Can my change be included?

@win5923
Copy link
Collaborator

win5923 commented Oct 3, 2025

Do you know when is the next release of the kuberay operator. Can my change be included?

Nov. 1, 2025 (Branch Cut: Oct. 10).
Ref: https://docs.google.com/document/d/1rdXniNitHCNTGfyvvMPdMkp1cDnQtjOmfqNiWuvrS9A/edit?tab=t.0#heading=h.ctb1p12e6p4u

Yes, I believe this PR can be included in v1.5, as the changes are relatively minor and should be safe to merge before the cut-off.

@troychiu
Copy link
Collaborator

troychiu commented Oct 3, 2025

Sorry for the late reply. I'll review this PR asap.

Copy link
Collaborator

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi since #3972 has been merged, do you mind resolving the conflicts? Thank you!

@mtian29 mtian29 force-pushed the mtian/volcano-topology branch from 8599979 to f9cb394 Compare October 10, 2025 16:49
@mtian29
Copy link
Contributor Author

mtian29 commented Oct 10, 2025

Hi since #3972 has been merged, do you mind resolving the conflicts? Thank you!

Thanks @troychiu . Rebased.

Copy link
Collaborator

@troychiu troychiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@mtian29
Copy link
Contributor Author

mtian29 commented Oct 11, 2025

Thank you!

@troychiu Thanks. I changed the tests following your suggestions and also fixed the lint errors.

@troychiu
Copy link
Collaborator

cc @Future-Outlier @rueian

@rueian rueian merged commit f22a75a into ray-project:master Oct 11, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants