-
Notifications
You must be signed in to change notification settings - Fork 11
Enable distributed training test with NCCL on ALPS (GH200) #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Yes, the current tests are only intended to check the metatrain side of distributed training, and we use soap-bpnn as a "simple" model for this. We would have to add distributed tests to the relevant architectures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thanks!
Co-authored-by: Guillaume Fraux <[email protected]>
cscs-ci run |
Thanks a lot! |
This PR builds on #687 and runs the distributed training example in
tests/distributed
. The example is run on a single node, with all 4 GPUs.The CI workflow is now the following:
The base container (used by the
tox
tests) is now only re-built if the Containerfile changes.This PR also adds
torch.distributed.destroy_process_group()
to the different trainers to get rid of the corresponding warnings.This is a starting point, but we should discuss about how to expand the distributed training tests (currently the example only tests SOAP-BPNN).
EDIT: For the distributed training tests I built on top of the base container. I did not yet explore the performance of this solution.
📚 Documentation preview 📚: https://metatrain--689.org.readthedocs.build/en/689/