-
Notifications
You must be signed in to change notification settings - Fork 386
fully_shard() for huggingface model: pytorch caches too much GPU memory #1126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mingdianliu could it be possible that the activations dominate the memory usage under such a setting? Like a 7B model, even if we use float32, then the parameters + gradients + optimizer states is like 112 GB and with 16 GPU, each GPU will get roughly 7GB. If you freeze some modules for fine-funing, this number fewer. Same for 72B model issue, you will have to apply other techniques to reduce the memory consumption from activations, like TP or activation checkpointing. |
Hi @fegin Thanks for your follow-up. I found it is due to pytorch cache. The allocated and reserved GPU memory is quite small while the cached GPU memory is even higher than 50GB. I had a shoot on torch.cuda.empty_cache() after each training iteration but the GPU memory cache during each training iteration is also high (~20GB). I wonder if it is a bug of FSDP2. If not, is there any method that can mitigate this issue? |
Caching is not an issue because those memory will be reused for other tensor allocation. But this will not cause OOM because when new tensors are created, PyTorch will first find some empty caching memory for the tensors. And only if there is no available caching space, will PyTorch ask CUDA to give more. And if CUDA cannot give enough memory, then OOM will happen. So, if you are not seeing OOM but only seeing high cache memory, that should not be an issue. You actually are seeing OOM, you can try to export this environment variable |
Uh oh!
There was an error while loading. Please reload this page.
Dear Community,
I'm working on fine-tuning the Qwen2-VL model using
fully_shard()
and wrote a script for it. However, I noticed that GPU memory usage stays high (around 50GB to 60GB) even as I scale up the number of GPUs. Besides, it will run into OOM when I try to fine tune 72B model with 128 GPUs.I'm wondering if there might be any issues with my code or configuration. I'd really appreciate any insights or suggestions you might have. Thanks in advance!
My code:
Running command:
torchrun --nnodes=2 --nproc_per_node=8 qwenvl_train_fsdp.py
torchrun --nnodes=4 --nproc_per_node=8 qwenvl_train_fsdp.py
torchrun --nnodes=8 --nproc_per_node=8 qwenvl_train_fsdp.py
The following is the screenshot of the result of
nvidia-smi
:16 GPU:
32 GPU:
64 GPU:
The text was updated successfully, but these errors were encountered: