Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An approach to sync between the nodes during the checkpoint copy activity #14

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

hosseinsarshar
Copy link
Owner

@hosseinsarshar hosseinsarshar commented Sep 4, 2024

Things to expect in this PR:

  • Added a new file: nemo/utils/model_copy.sh that launches model_copy.py - a distributed approach to sync between ranks when rank 0 is copying the checkpoint
  • The model_copy.py script is responsible to copy the checkpoint on rank 0 and wait on other ranks
  • fixes to the launch script to account for this checkpoint - one key difference with other approach is the lack of the need for a centralized storage to have the dummy file - it uses the dist.barrier() to sync between the nodes - I set a manual timeout to fail if it takes longer than 30 min as it might get stuck for ever.
  • README - a few fixes on the readme file

Link to a test job: nemo_llama3-70b_continual-pretraining_8

@hosseinsarshar hosseinsarshar changed the title Sync copy mpi An approach to sync between the nodes during the checkpoint copy activity Sep 4, 2024
@hosseinsarshar
Copy link
Owner Author

I closed the previous PR #13 as it was using an old approach.

nemo/launch.sh Outdated
@@ -55,19 +58,20 @@ export ADDITIONAL_ARGS="++model.micro_batch_size=$MICRO_BATCH ++trainer.max_step
# == construct job launch command ==

# create base job launch command
export LAUNCH_CMD="git clone https://github.com/hosseinsarshar/dist-training-vertex.git &&"
export LAUNCH_CMD="git clone -b sync-copy-mpi https://github.com/hosseinsarshar/dist-training-vertex.git &&"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just need to remove the branch reference

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in this commit e0b885d

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed these by accident i think

Copy link
Owner Author

@hosseinsarshar hosseinsarshar Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - removed them in this commit 8920249

Copy link
Collaborator

@bvrockwell bvrockwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few things but looks really good overall!

@@ -0,0 +1,31 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why, but it might be better to name the python script and bash script differently haha - i had to double take.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants