add a new DAG 'maxtext_multi_tier_sav01_save_local' #770
+530
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This test verifies the Orbax Multitier Checkpointing local saving function, with phase 2 replicator enabled. It conducts the following tasks:
Run the MaxText training by enabling the checkpointing, until the local checkpoints have been saved.
Clean up the saved local ram checkpoints.
Using logging explorer API to query log entries during the tests, and verify they contain the logs that local checkpoints have been saved.
Note: The local checkpoints validation is through the logging explorer API, instead of going into the pod and checking the saved checkpoints files.
Tests
cluster env:
project: cloud-tpu-multipod-dev
name: b425674043-v5p64
zone: europe-west4-b
highScaleCheckpointingConfig: enabled
GcsFuseCsiDriver: enabled
Workload Identity Federation: enabled
gcs bucket:
name: mtc-automation-bucket
Hierarchical namespace: enabled
airflow/composer env:
project: cloud-ml-auto-solutions
name: erniechang-test
composer version: 2.13.4
airflow version: 2.10.5
test result:
Tests have been conducted in the "cloud-ml-auto-solutions" project with cloud composer.
Test link
Checklist
Before submitting this PR, please make sure (put X in square brackets):