-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM for training llama #1900
Comments
Thanks for the feedback. It does work on 4 x L4s, which have 24 Gb each. I can see that the usage is around 22-24 GB. Other than trying a smaller batch size or block size, or perhaps a different multi-GPU strategy, I am not sure how this can be improved. |
@rasbt thanks for the quick rely. So is it taking 22GB in total across the GPUs or on each GPU? I would think a sequential load strategy could help split the model across the GPUs and 64GB should be enough for it, but when using |
It was on each GPU. I think that it uses substantially less RAM than 22 x 4 in total though; it might be that it works just fine on a single GPU with 40 Gb but I haven't tried. You could also consider an FSDP strategy with |
Interestingly, using the CLI tool, I'm even able to finetune Llama 3.1 8B with no quantization across the 4 GPUs, although I suspect that's thanks to LoRA, will need to check if it works with the Python API as well. |
Ah yes, |
I'm trying to use the llama-3.2-1B model with the Python API on a compute with 4 Tesla V100s (4*16GB), but the process keeps failing due to OOM. Watching
nvidia-smi
, I see the utilization shoot up to 16GB on each gpu and then the process dies. The 1B model should work with much lesser VRAM from my understanding, or maybe I'm doing something incorrect. Here is my code:The process dies before even the first training pass. I tried a few approaches with quantization, by defining
quantize
(and other params) inself.llm.distribute
in thesetup
method as well, but none of these approaches seem to work. Any ideas on what I might be doing wrong? Thanks.The text was updated successfully, but these errors were encountered: