-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An approach to sync between the nodes during the checkpoint copy activity #14
base: main
Are you sure you want to change the base?
Conversation
I closed the previous PR #13 as it was using an old approach. |
nemo/launch.sh
Outdated
@@ -55,19 +58,20 @@ export ADDITIONAL_ARGS="++model.micro_batch_size=$MICRO_BATCH ++trainer.max_step | |||
# == construct job launch command == | |||
|
|||
# create base job launch command | |||
export LAUNCH_CMD="git clone https://github.com/hosseinsarshar/dist-training-vertex.git &&" | |||
export LAUNCH_CMD="git clone -b sync-copy-mpi https://github.com/hosseinsarshar/dist-training-vertex.git &&" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just need to remove the branch reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in this commit e0b885d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed these by accident i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - removed them in this commit 8920249
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few things but looks really good overall!
nemo/utils/model_copy.sh
Outdated
@@ -0,0 +1,31 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why, but it might be better to name the python script and bash script differently haha - i had to double take.
Things to expect in this PR:
nemo/utils/model_copy.sh
that launchesmodel_copy.py
- a distributed approach to sync between ranks when rank 0 is copying the checkpointdist.barrier()
to sync between the nodes - I set a manual timeout to fail if it takes longer than 30 min as it might get stuck for ever.Link to a test job: nemo_llama3-70b_continual-pretraining_8