-
Notifications
You must be signed in to change notification settings - Fork 631
[RayCluster] Prototype multi-host indexing #3998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
8a74046
to
a6b94b3
Compare
@ryanaoleary PTAL when you get the chance. |
It'd be good to make clear the value of this PR. Currently host and replica indexing for multi-host workers occurs in a separate GKE webhook that injects these values as env vars and a k8s label. The env vars and This PR moves the logic for indexing KubeRay worker Pods that request TPU from the webhook to KubeRay itself. By assigning indices as k8s Pod labels directly from KubeRay when they are created, we avoid the necessity for complicated logic in the TPU webhook that tracks the state of multi-host replicas in a RayCluster using a PodInformer. Since these variables are already used in Ray core and libraries like Train to handle the multi-host case, it makes sense to consolidate the logic in KubeRay. Additionally, since KubeRay is aware of when Pods are deleted, it becomes easier to scale-down multi-host replicas atomically. Overall, this PR is consolidating logic that is currently spread across the TPU webhook, KubeRay, and Ray core. The next step after this PR would be to move the environment variable injection that occurs in the TPU webhook to Ray core when the Raylet is started on a node. The worker lifecycle would then look as follows for multi-host workers:
|
a6b94b3
to
6935b9e
Compare
} | ||
|
||
// Check if RayTpuMulithostIndexing feature is enabled | ||
// Currently multihostIndexing won't work with Autoscaler v2 since autoscaler delete doesn't follow replica groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Currently multihostIndexing won't work with Autoscaler v2 since autoscaler delete doesn't follow replica groups
Can you explain more why this wouldn't work with the v2 autoscaler? Since it currently scales by replicas my initial thinking was that there shouldn't be an incompatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, if that's the case then it is an misunderstanding on my side based on our prior discussion where my interpretation was that there was incompatibility due to how it scaled the replicas. Will remove the autoscaling v2 check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could just be forgetting what the incompatibility is, if I'm remembering correctly the v2 autoscaler determines the number of replicas of a group to scale, and then submits a scale request by patching both the replica count and workersToDelete
of that group here.
There could be an issue with how the v2 autoscaler scales down here, since it doesn't consider whether to_delete_id
is part of a multi-host group and will cause the entire group to scale down, but I think this might be fine though since we consider NumOfHosts
in the desired num workers of a type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add an e2e autoscaler test here https://github.com/ray-project/kuberay/tree/master/ray-operator/test/e2eautoscaler to verify the scale up/down behavior for both the V1 and V2 autoscaler in either this PR or a follow-up. That should probably be one of the requirements for moving this feature from alpha to beta.
45d53ae
to
bb602d5
Compare
Once this is passing CI I think we can mark this as ready for review and ask other KubeRay contributors to review the feature. |
5bebd86
to
1f21e83
Compare
1f21e83
to
7419b56
Compare
return errstd.Join(utils.ErrFailedCreateWorkerPod, err) | ||
|
||
// Worker creation path for multi-host indexing | ||
if multihostIndexingEnabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic below seems pretty complicated, should we abstract it away in a util package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we do since all it's really doing is going through and creating the remaining workers in groups. Abstracting it away in a util package will introduce another layer of indirection and I thought that might be unnecessary.
I removed the "[TPU]" from the title, we should ensure this implementation is generic enough to also be used for GPUs. For e.g. labelling worker pods in an NVLink using GB200s |
7419b56
to
daac294
Compare
Signed-off-by: Aaron Liang <[email protected]>
daac294
to
786db45
Compare
Why are these changes needed?
Part of #3902. POC, Adds group indexing and host index to multihosted workers.
Related issue number
For: #3902
Checks