-
As far as I understand, TorchFT integration in TorchTitan will make enable HSDP to be fault tolerance, i.e some replica groups can fail, be restarted while the optimizer will keep stepping with the average gradient of the working replicate groups. From the config, I understand when enable torchft, we would replace However, running the second command give the error Can anyone help explain how to properly enable torchft with torchtitan? @fegin since you landed this feature. Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Will draft a document for using TorchFT. |
Beta Was this translation helpful? Give feedback.
#993