Adding LWS Integration #1174

Edwinhr716 · 2025-05-12T16:10:08Z

Added integration with https://github.com/kubernetes-sigs/lws for TPUs, as well as integration of LWS + Pathways.

To run basic LWS+TPU

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

To run LWS+Pathways

axlearn gcp launch run --cluster=$CLUSTER \
--runner_name gke_tpu_lws_pathways \
--name=$USER \
--instance_type=tpu-v6e-16 \
--bundler_spec=allow_dirty=True \
--bundler_type=artifactregistry --bundler_spec=image=tpu \
--bundler_spec=dockerfile=Dockerfile --bundler_spec=target=tpu \
-- sleep infinity;

axlearn/cloud/gcp/job.py

axlearn/cloud/gcp/runners/gke.py

axlearn/cloud/gcp/lws_utils.py

muyangyuapple · 2025-05-13T05:06:52Z

axlearn/cloud/gcp/lws_utils.py

+
+    def __call__(self) -> Nested[Any]:
+        system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type]
+        return dict(


What's the retry policy of LWS?

Could you help me what happen when:

leader fails/is preempted?

a worker fails/is preempted?

The default behavior is that if any pod in the group fails, regardless if it is a leader or a worker, the whole group fails. LWS also supports not restarting the whole group by setting RestartPolicy:None

How about failures between groups? Could you help me compare the failure handling at all levels between LWS and jobset?

There are no failure policies between groups. Each group is independent of each other, so if one group fails, the other will continue running

muyangyuapple · 2025-05-13T05:07:36Z

axlearn/cloud/gcp/lws_utils.py

+    def __call__(self) -> Nested[Any]:
+        system = USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS[self._tpu_type]
+        return dict(
+            size=system.vms_per_slice,


What's the use case where a leader worker set w/o leader?

All use cases that LWS has still apply on an LWS without a leader. The only difference is that the dual-template feature is not being used.

I made the generic TPULeaderWorkerTemplate be single template to mirror the TPUReplicatedJob

muyangyuapple

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

axlearn/cloud/gcp/lws_utils.py

muyangyuapple · 2025-05-20T00:42:04Z

axlearn/cloud/gcp/lws_utils.py

+        if self._tpu_type not in USER_FACING_NAME_TO_SYSTEM_CHARACTERISTICS:
+            raise NotImplementedError(f"Missing system characteristics for {self._tpu_type}")
+
+    def _build_container(self) -> dict:


@markblee _build_container and _build_pod should be able to shared w/ the jobset version. Do you have preference to extract them to a parent class or use the modifier pattern?

muyangyuapple · 2025-05-20T00:42:56Z

axlearn/cloud/gcp/utils.py

@@ -378,3 +378,28 @@ class GCPAPI(str, enum.Enum):
    """GCP API to submit resource requests to."""

    GKE = "GKE"
+
+
+def delete_k8s_leaderworkerset(name: str, *, namespace: str):


Can you also define list_k8s_leaderworkerset? It will used by some tooling.

lkolluru05 · 2025-05-20T23:11:41Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

so create something like LWSMultiheadReplicatedJob ?

ruomingp

Will defer to @Ethanlm and @markblee for approval.

Edwinhr716 · 2025-05-27T14:20:42Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

muyangyuapple · 2025-06-02T23:17:13Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

I want to make sure N replicas of the Pathways cluster will be created.

Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created?

If this is the case, then I think gke_tpu_lws_pathways already cover this case.

Could you confirm it?

Edwinhr716 · 2025-06-02T23:37:00Z

Could you also cover the PathwaysMultiheadReplicatedJob, where we creates multiple pathways cluster replicas at a time?

By this you mean creating multiple multi-host inference deployments?

I want to make sure N replicas of the Pathways cluster will be created.

Different from Jobset which uses replicated job to control the replication. LWS will replicates a group as a whole, right? E.g. if you set --num_replicas=N, then N head node and N TPU worker group will be created?

If this is the case, then I think gke_tpu_lws_pathways already cover this case.

Could you confirm it?

That is correct, if --num_replicas=N, it will create N replicas of the Pathways cluster

Edwinhr716 · 2025-06-02T23:41:39Z

The number of workers is not set by --num-replicas however, it is determined by the machine type. So for a TPU 4x4 multi-slice, it will create 4 workers

axlearn/axlearn/cloud/gcp/lws_utils.py

Line 194 in cd3ffe1

size=system.vms_per_slice,

Edwinhr716 · 2025-06-09T15:23:52Z

What else is needed to merge this PR?

samos123 · 2025-06-09T15:50:27Z

@muyangyuapple or @Ethanlm could you please provide another review? It takes effort to keep a large PR open because main frequently changes. Right now this branch has conflicts with main. After your approval, we'll also have to get approval from Mark.

axlearn/cloud/gcp/runners/__init__.py

Ethanlm

Haven't finished my review yet. Left some initial minor comments.

Can you please provide some concrete test examples in the PR summary, and demonstrate what a LWS TPU job and a LWS pathways job would look like on k8s?

Like what services and pods are created on k8s, and what the naming convention look like, and what env variables or annotations are added by LWS controller automatically?

axlearn/cloud/gcp/job.py

axlearn/cloud/gcp/lws_utils.py

axlearn/cloud/gcp/pathways_utils.py

muyangyuapple · 2025-06-17T22:56:41Z

axlearn/cloud/gcp/pathways_utils.py

@@ -556,3 +565,147 @@ def __call__(self) -> Sequence[Nested[Any]]:
            )

        return replicated_jobs
+
+
+class PathwaysLeaderWorkerTemplate(BaseLeaderWorkerTemplate):


We also need to set these env var in this builder: https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L240-L253

The address of the leader is injected into all the containers, do we still need it?

axlearn/cloud/gcp/node_pool_provisioner.py

axlearn/cloud/gcp/pathways_utils.py

ampolloreno · 2025-07-01T23:52:22Z

axlearn/cloud/gcp/lws_utils_test.py

+from axlearn.common.test_utils import TestCase
+
+
+class TPULeaderWorkerTemplateTest(TestCase):


this is removed now, right?

muyangyuapple · 2025-07-02T00:38:31Z

axlearn/cloud/gcp/node_pool_provisioner.py

@@ -87,8 +88,11 @@ def create_for(self, job: GKEJob):

        # TODO(markblee,ethanli,muyang_yu): Refactor so we do not need to make assumptions about
        # TPUGKEJob implementation and internals.
-        if not isinstance(builder_cfg, TPUReplicatedJob.Config):
-            raise TypeError(f"Expected {TPUReplicatedJob.Config}, got {type(builder_cfg)}.")
+        if not isinstance(builder_cfg, TPUReplicatedJob.Config, BaseLeaderWorkerTemplate.Config):


s/BaseLeaderWorkerTemplate/PathwaysLeaderWorkerTemplate

Please also make similar change at line 184 in the delete_for method.

muyangyuapple · 2025-07-02T00:41:12Z

axlearn/cloud/gcp/pathways_utils_test.py

+        with mock_gcp_settings([lws_utils.__name__, bundler.__name__]):
+            fv = flags.FlagValues()
+            cfg = pathways_utils.PathwaysLeaderWorkerTemplate.default_config().set(
+                inner=lws_utils.TPULeaderWorkerTemplate.default_config()


We shouldn't have inner anymore

muyangyuapple · 2025-07-02T00:41:42Z

axlearn/cloud/gcp/runners/gke_test.py

+            self.assertIsNotNone(cfg.name)
+        self.assertEqual(cfg.cluster, cluster or self._settings["gke_cluster"])
+        self.assertEqual(cfg.enable_pre_provisioner, enable_pre_provisioner)
+        builder_cfg: TPULeaderWorkerTemplate.Config = cfg.inner.builder


Please update all the TPULeaderWorkerTemplate reference to the new class.

muyangyuapple · 2025-07-02T00:52:27Z

axlearn/cloud/gcp/pathways_utils.py

+        pod_spec["nodeSelector"].update(
+            {
+                _PATHWAYS_HEAD_NODE_POOL_SELECTOR_KEY: _PATHWAYS_HEAD_NODE_POOL_SELECTOR_VALUE,
+            }
+        )


As discussed offline, we use TPU pod as the head pod in this version. So we should use the normal TPU node selector.

muyangyuapple · 2025-07-02T16:39:15Z

axlearn/cloud/gcp/utils.py

@@ -81,7 +81,8 @@ def running_from_vm() -> bool:
        capture_output=True,
        text=True,
    )
-    return (out.returncode == 0) and "Metadata-Flavor: Google" in out.stdout
+    return False


Please revert this change?

Co-authored-by: Meng (Ethan) Li <[email protected]>

Edwinhr716 requested review from ruomingp, markblee and a team as code owners May 12, 2025 16:10

muyangyuapple reviewed May 13, 2025

View reviewed changes

muyangyuapple reviewed May 19, 2025

View reviewed changes

muyangyuapple reviewed May 20, 2025

View reviewed changes

ruomingp requested a review from Ethanlm May 21, 2025 13:35

ruomingp reviewed May 21, 2025

View reviewed changes

muyangyuapple reviewed Jun 17, 2025

View reviewed changes

axlearn/cloud/gcp/runners/__init__.py Outdated Show resolved Hide resolved

Ethanlm reviewed Jun 17, 2025

View reviewed changes

lkolluru05 requested a review from a team as a code owner June 17, 2025 18:47

muyangyuapple reviewed Jun 17, 2025

View reviewed changes

muyangyuapple reviewed Jun 25, 2025

View reviewed changes

axlearn/cloud/gcp/node_pool_provisioner.py Outdated Show resolved Hide resolved

muyangyuapple reviewed Jun 30, 2025

View reviewed changes

axlearn/cloud/gcp/pathways_utils.py Show resolved Hide resolved

ampolloreno reviewed Jul 1, 2025

View reviewed changes

muyangyuapple reviewed Jul 2, 2025

View reviewed changes

Edwinhr716 and others added 6 commits July 2, 2025 17:36

basic working example

459e050

runner for lws added

ca5a481

added updatig condition

bcd5556

added spec for pathways in lws

6fe4af7

updated to variables

8b511e9

refactored to match JobSet Pathways implementation

d889328

Edwinhr716 and others added 25 commits July 2, 2025 17:36

added new runner for jetstream-pathways

b9a5bd0

bug fixes, made num_replicas a parameter for an LWS object

99b6e35

ran precommits

f1ae5dd

added pathways utils tests

f0ad47c

added more unit tests

56f2436

removed mentions of jetstream, fixed bug in runner

40ee8f3

added tests for gke runner

74c2024

minor fixes

a3866ca

cleaned up tests

3a498bc

removed jetstream from runner name

154b636

addressed comments

20d4435

test

f7072c4

service and PR changes

ca89377

Update axlearn/cloud/gcp/job.py

86c2150

Co-authored-by: Meng (Ethan) Li <[email protected]>

Update axlearn/cloud/gcp/job.py

7993e55

Co-authored-by: Meng (Ethan) Li <[email protected]>

Update axlearn/cloud/gcp/job.py

14e4fad

Co-authored-by: Meng (Ethan) Li <[email protected]>

Update axlearn/cloud/gcp/job.py

5c8c676

Co-authored-by: Meng (Ethan) Li <[email protected]>

added flag

23f9d21

added patch

d03125f

flatten logic, removed inner

5b9e615

cleanup

e6d7100

Leader on a seperate CPU node

0f3e915

changed to subgroup exclusive policy

10a8fb6

added image

5ce7dff

fixed node selector on head pod

c659c50

Edwinhr716 force-pushed the lws-integration branch from bcf4269 to c659c50 Compare July 2, 2025 17:41

Edwinhr716 added 4 commits July 2, 2025 18:07

added backend platforms env variable

b4435bc

added jax backend target variable

66191ee

add rm port

0caf597

fixed issue with proxy error

52c6791

		from axlearn.common.test_utils import TestCase


		class TPULeaderWorkerTemplateTest(TestCase):

Adding LWS Integration #1174

Are you sure you want to change the base?

Adding LWS Integration #1174

Uh oh!

Conversation

Edwinhr716 commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edwinhr716 May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

muyangyuapple left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkolluru05 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruomingp left a comment

Choose a reason for hiding this comment

Uh oh!

Edwinhr716 commented May 27, 2025

Uh oh!

muyangyuapple commented Jun 2, 2025

Uh oh!

Edwinhr716 commented Jun 2, 2025

Uh oh!

Edwinhr716 commented Jun 2, 2025

Uh oh!

Edwinhr716 commented Jun 9, 2025

Uh oh!

samos123 commented Jun 9, 2025

Uh oh!

Uh oh!

Ethanlm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edwinhr716 Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edwinhr716 commented May 12, 2025 •

edited

Loading

Edwinhr716 May 13, 2025 •

edited

Loading

lkolluru05 commented May 20, 2025 •

edited

Loading

Ethanlm left a comment •

edited

Loading

Edwinhr716 Jun 25, 2025 •

edited

Loading