Adding basic elastic training (pause-and-resume) #1256

lukebaumann · 2025-06-12T21:37:49Z

Elastic training with Pathways on Cloud allows jobs to react to failures without a full restart.

The elastic training mode in this PR is one form of Fast Resume where the main run is wrapped in a while loop and try-except blocks so that if a slice is lost mid-run, the run wait for the lost slice to rejoin the Pathways cluster, and recover from the most recent checkpoint.

This type of elastic training significantly re-initializes the run and relies on checkpoints to recover from. The primary benefit of this flavor of elastic training is avoiding the need to restart the entire JobSet on failure and instead only restart a single slice's worth of pods.

Ethanlm · 2025-06-17T18:08:26Z

pyproject.toml

+pathways-tpu = [
+    "axlearn[gcp]",
+    "jax==0.5.3",  # must be >=0.4.19 for compat with v5p.
+    "pathwaysutils @ git+https://github.com/AI-Hypercomputer/pathways-utils",  # For JAX+Pathways single-controller accelerator coordinator.


Will 0.1.1 not work for the pause-and-resume use case?

The Manager.wait_for_slices was added recently. I will cut a new release of pathwaysutils that includes it. Before I do that, I need to ensure there is not a newly introduced dependency on jax>0.5.3 within pathwaysutils.

This change was for verification.

0.1.1 will work if we add a similar wait_for_slices to axlearn (as is done in MaxText for the last couple months) but I rather not do that.

Ethanlm

Could you include in the PR summary to demonstrate and validate the pause-and-resume feature works?

Thanks

Ethanlm · 2025-06-17T18:09:28Z

axlearn/common/launch_trainer_main.py

-    trainer_config.set(recorder=config_for_function(lambda: measurement.global_recorder))
-    measurement.start_monitoring()
-    launch_trainer.run_trainer(trainer_config)
+    elastic_manager = manager.Manager()


Can we make it enabled only in pathways environment?

Looks like we need pathwaysutils.initialize() before calling this

shauryagup · 2025-06-17T20:54:39Z

Note that we also need to set elasticity flags on the proxy server inside the jobset YAML for the complete e2e test to work

lukebaumann · 2025-06-17T21:15:25Z

FYI this is a draft PR and I am still verifying functionality. At this point for this flavor of fast-resume, there should not be any major additional code changes.

Added guards to only use fast-resume if the proxy backend is used.

…ntiation of the trainer and creation of the PRNGKey inside the elastic loop

…rease checkpoint frequency

apghml · 2025-06-19T02:03:29Z

axlearn/common/launch_trainer.py

+            except jax.errors.JaxRuntimeError as error:
+                if not elastic_manager.is_error_due_to_slice_down(error):
+                    raise
+                ten_minutes = 10 * 60


Why ten minutes?

apghml · 2025-06-19T02:03:36Z

axlearn/common/launch_trainer.py

-    trainer: SpmdTrainer = trainer_config.instantiate(parent=None)
-    prng_key = jax.random.PRNGKey(seed=FLAGS.trainer_prng_seed)
-    output = trainer.run(prng_key)
+    if False and FLAGS.jax_backend == "proxy":


Why is this disabled?

apghml · 2025-06-19T02:05:12Z

axlearn/common/launch_trainer.py

+        # pylint: disable-next=import-error,import-outside-toplevel
+        from pathwaysutils.elastic import manager
+        elastic_manager = manager.Manager()
+        while True:


Do we want a max number of failed retries before we terminate? (E.g., if every retry terminates with an error in <1hr more than 5 times in a row, fail, or something?

apghml · 2025-06-19T02:06:52Z

Could you include in the PR summary to demonstrate and validate the pause-and-resume feature works?

Thanks

Is there an internal PR for this? (If so, DM could you me the link internally?) Maybe Luke can post a job the link there?

Ethanlm · 2025-06-20T07:33:32Z

Could you include in the PR summary to demonstrate and validate the pause-and-resume feature works?
Thanks

Is there an internal PR for this? (If so, DM could you me the link internally?) Maybe Luke can post a job the link there?

I was able to verify this feature works. I will slack you an internal doc

lukebaumann requested review from ruomingp, markblee and a team as code owners June 12, 2025 21:37

lukebaumann force-pushed the elastic branch from 2958d88 to a96c62b Compare June 16, 2025 21:38

Ethanlm reviewed Jun 17, 2025

View reviewed changes

lukebaumann marked this pull request as draft June 17, 2025 21:12

lukebaumann added 5 commits June 17, 2025 22:38

Adding basic elastic training

4a87c9a

Using the timeout kwarg for wait_for_slices

ec1ae49

Adding a pathways-tpu optional dependency kind

d568ad1

Moved elastic logic from launch_trainer_main to launch_trainer.

d850d97

Added guards to only use fast-resume if the proxy backend is used.

Added the changes to the jobset for elastic training

b75b5c6

lukebaumann force-pushed the elastic branch from 4d33b9c to b75b5c6 Compare June 17, 2025 22:38

lukebaumann added 7 commits June 17, 2025 22:57

Made some changes that I expect to be necessary like moving the insta…

3c7956f

…ntiation of the trainer and creation of the PRNGKey inside the elastic loop

Updated some comments in pyproject.toml

c707f3a

Adding a break so that a successful run exits the run loop.

e7cc63c

Adding a pathways-tpu docker image

9f08b3c

Temporary changes to the configuration to decrease batch size and inc…

168e41e

…rease checkpoint frequency

Adding back some code to the non-elastic path

1af5203

Use the public API for is_error_due_to_slice_down from pathwaysutils

8da2691

apghml reviewed Jun 19, 2025

View reviewed changes

Ethanlm changed the title ~~Adding basic elastic training~~ Adding basic elastic training (pause-and-resume) Jun 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding basic elastic training (pause-and-resume) #1256

Adding basic elastic training (pause-and-resume) #1256

Uh oh!

lukebaumann commented Jun 12, 2025 •

edited

Loading

Uh oh!

Ethanlm Jun 17, 2025

Uh oh!

lukebaumann Jun 17, 2025

Uh oh!

Ethanlm left a comment

Uh oh!

Ethanlm Jun 17, 2025

Uh oh!

Ethanlm Jun 17, 2025

Uh oh!

shauryagup commented Jun 17, 2025

Uh oh!

lukebaumann commented Jun 17, 2025

Uh oh!

apghml Jun 19, 2025

Uh oh!

apghml Jun 19, 2025

Uh oh!

apghml Jun 19, 2025

Uh oh!

apghml commented Jun 19, 2025 •

edited

Loading

Uh oh!

Ethanlm commented Jun 20, 2025

Uh oh!

Uh oh!

Adding basic elastic training (pause-and-resume) #1256

Are you sure you want to change the base?

Adding basic elastic training (pause-and-resume) #1256

Uh oh!

Conversation

lukebaumann commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ethanlm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

lukebaumann Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Ethanlm left a comment

Choose a reason for hiding this comment

Uh oh!

Ethanlm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Ethanlm Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

shauryagup commented Jun 17, 2025

Uh oh!

lukebaumann commented Jun 17, 2025

Uh oh!

apghml Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

apghml Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

apghml Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

apghml commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ethanlm commented Jun 20, 2025

Uh oh!

Uh oh!

lukebaumann commented Jun 12, 2025 •

edited

Loading

apghml commented Jun 19, 2025 •

edited

Loading