-
Notifications
You must be signed in to change notification settings - Fork 346
Adding LWS Integration #1174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adding LWS Integration #1174
Conversation
axlearn/cloud/gcp/lws_utils.py
Outdated
|
||
def __call__(self) -> Nested[Any]: | ||
system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type] | ||
return dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the retry policy of LWS?
Could you help me what happen when:
- leader fails/is preempted?
- a worker fails/is preempted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior is that if any pod in the group fails, regardless if it is a leader or a worker, the whole group fails. LWS also supports not restarting the whole group by setting RestartPolicy:None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about failures between groups? Could you help me compare the failure handling at all levels between LWS and jobset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no failure policies between groups. Each group is independent of each other, so if one group fails, the other will continue running
axlearn/cloud/gcp/lws_utils.py
Outdated
def __call__(self) -> Nested[Any]: | ||
system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type] | ||
return dict( | ||
size=system.vms_per_slice, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case where a leader worker set w/o leader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All use cases that LWS has still apply on an LWS without a leader. The only difference is that the dual-template feature is not being used.
I made the generic TPULeaderWorkerTemplate
be single template to mirror the TPUReplicatedJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also cover the PathwaysMultiheadReplicatedJob
, where we creates multiple pathways cluster replicas at a time?
axlearn/cloud/gcp/lws_utils.py
Outdated
if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS: | ||
raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}") | ||
|
||
def _build_container(self) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markblee _build_container
and _build_pod
should be able to shared w/ the jobset version. Do you have preference to extract them to a parent class or use the modifier pattern?
@@ -378,3 +378,28 @@ class GCPAPI(str, enum.Enum): | |||
"""GCP API to submit resource requests to.""" | |||
|
|||
GKE = "GKE" | |||
|
|||
|
|||
def delete_k8s_leaderworkerset(name: str, *, namespace: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also define list_k8s_leaderworkerset
? It will used by some tooling.
so create something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By this you mean creating multiple multi-host inference deployments? |
I want to make sure N replicas of the Pathways cluster will be created. Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created? If this is the case, then I think Could you confirm it? |
That is correct, if --num_replicas=N, it will create N replicas of the Pathways cluster |
The number of workers is not set by --num-replicas however, it is determined by the machine type. So for a TPU 4x4 multi-slice, it will create 4 workers axlearn/axlearn/cloud/gcp/lws_utils.py Line 194 in cd3ffe1
|
What else is needed to merge this PR? |
@muyangyuapple or @Ethanlm could you please provide another review? It takes effort to keep a large PR open because main frequently changes. Right now this branch has conflicts with main. After your approval, we'll also have to get approval from Mark. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't finished my review yet. Left some initial minor comments.
Can you please provide some concrete test examples in the PR summary, and demonstrate what a LWS TPU job and a LWS pathways job would look like on k8s?
Like what services and pods are created on k8s, and what the naming convention look like, and what env variables or annotations are added by LWS controller automatically?
@@ -556,3 +565,147 @@ def __call__(self) -> Sequence[Nested[Any]]: | |||
) | |||
|
|||
return replicated_jobs | |||
|
|||
|
|||
class PathwaysLeaderWorkerTemplate(BaseLeaderWorkerTemplate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to set these env var in this builder: https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L240-L253
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The address of the leader is injected into all the containers, do we still need it?
from axlearn.common.test_utils import TestCase | ||
|
||
|
||
class TPULeaderWorkerTemplateTest(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is removed now, right?
@@ -87,8 +88,11 @@ def create_for(self, job: GKEJob): | |||
|
|||
# TODO(markblee,ethanli,muyang_yu): Refactor so we do not need to make assumptions about | |||
# TPUGKEJob implementation and internals. | |||
if not isinstance(builder_cfg, TPUReplicatedJob.Config): | |||
raise TypeError(f"Expected {TPUReplicatedJob.Config}, got {type(builder_cfg)}.") | |||
if not isinstance(builder_cfg, TPUReplicatedJob.Config, BaseLeaderWorkerTemplate.Config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/BaseLeaderWorkerTemplate/PathwaysLeaderWorkerTemplate
Please also make similar change at line 184 in the delete_for method.
with mock_gcp_settings([lws_utils.__name__, bundler.__name__]): | ||
fv = flags.FlagValues() | ||
cfg = pathways_utils.PathwaysLeaderWorkerTemplate.default_config().set( | ||
inner=lws_utils.TPULeaderWorkerTemplate.default_config() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have inner anymore
self.assertIsNotNone(cfg.name) | ||
self.assertEqual(cfg.cluster, cluster or self._settings["gke_cluster"]) | ||
self.assertEqual(cfg.enable_pre_provisioner, enable_pre_provisioner) | ||
builder_cfg: TPULeaderWorkerTemplate.Config = cfg.inner.builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update all the TPULeaderWorkerTemplate
reference to the new class.
axlearn/cloud/gcp/pathways_utils.py
Outdated
pod_spec["nodeSelector"].update( | ||
{ | ||
_PATHWAYS_HEAD_NODE_POOL_SELECTOR_KEY: _PATHWAYS_HEAD_NODE_POOL_SELECTOR_VALUE, | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we use TPU pod as the head pod in this version. So we should use the normal TPU node selector.
@@ -81,7 +81,8 @@ def running_from_vm() -> bool: | |||
capture_output=True, | |||
text=True, | |||
) | |||
return (out.returncode == 0) and "Metadata-Flavor: Google" in out.stdout | |||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change?
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
Co-authored-by: Meng (Ethan) Li <[email protected]>
bcf4269
to
c659c50
Compare
Added integration with https://github.com/kubernetes-sigs/lws for TPUs, as well as integration of LWS + Pathways.
To run basic LWS+TPU
To run LWS+Pathways