Skip to content

add a new DAG 'maxtext_multi_tier_sav01_save_local' #770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ernie-chang
Copy link

Description

This test verifies the Orbax Multitier Checkpointing local saving function, with phase 2 replicator enabled. It conducts the following tasks:

Run the MaxText training by enabling the checkpointing, until the local checkpoints have been saved.
Clean up the saved local ram checkpoints.
Using logging explorer API to query log entries during the tests, and verify they contain the logs that local checkpoints have been saved.
Note: The local checkpoints validation is through the logging explorer API, instead of going into the pod and checking the saved checkpoints files.

Tests

cluster env:

project: cloud-tpu-multipod-dev
name: b425674043-v5p64
zone: europe-west4-b
highScaleCheckpointingConfig: enabled
GcsFuseCsiDriver: enabled
Workload Identity Federation: enabled
gcs bucket:

name: mtc-automation-bucket
Hierarchical namespace: enabled
airflow/composer env:

project: cloud-ml-auto-solutions
name: erniechang-test
composer version: 2.13.4
airflow version: 2.10.5
test result:
Tests have been conducted in the "cloud-ml-auto-solutions" project with cloud composer.
Test link

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

[UPDATE] Update with comment

[NEW] Add orbax utils: apply cpc and delete cpc to the dag

[UPDATE] add delete cpc and apply cpc in orbax

[UPDATE] Update format and coding style

[UPDATE]

[UPDATE] Modify the dag structure and add cpc operation

[UPDATE] Modify the dag structure and add cpc operation

[UPDATE] update for the code comment

[UPDATE] update for the code comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant