Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch_run.py lacking autocast and scaling for Automatic Mixed Precision #45

Open
bhavnicksm opened this issue May 9, 2024 · 1 comment

Comments

@bhavnicksm
Copy link

Hey,

As mentioned in the title, there is the direct conversion of the model to BF16, without the use of torch.amp functions of autocast and scaling needed for AMP.

This means that the projected memory shown here is only the 2bytes for the model (BF16) but the results post-training would be bad as per various sources. Beyond that, we would need AMP for it to work properly, which means getting 6 bytes per parameter, which blows the 24GiB mentioned in the paper out of the water.

For LLaMa3 8B, you would need 8 * 10^9 * 6 bytes ~ 44GiB for just parameter loading in BF16 AMP.

Just wanted to point it out, and ask about why this is made this way. The paper also mentions a 58GiB minimum -- but I think you'd need much more than that.

If this is a deliberate decision, please point me to the studies that show that such training has been stabilized.

src: [ https://docs.fast.ai/callback.fp16.html ]

@kyleliang919
Copy link

kyleliang919 commented May 9, 2024

Hi @bhavnicksm, my latest finding is that it might not be the problem of Galore... however adam8bit is itself unstable. Galore just makes it even more unstable and it manifests even earlier in pretraining... If you train with adam8bit (full rank) for long enough, it will collapse at some point. Overall my current feeling is despite of what was claimed, this method along with adam8bit is only stable for finetuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants