-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Launched finetuning job as follows and it failed with OOM Error for Llama-2-70B
ray_job_log_job_eqeqt513ex4xy1sgwgcjk8ag1i.log
$ python main.py job_compute_configs/aws.yaml training_configs/lora/llama-2-70b-4k-4xg5_48xlarge.yaml
Error
result = forward_call(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 268, in forward
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB (GPU 4; 21.99 GiB total capacity; 16.79 GiB already allocated; 907.38 MiB free; 20.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Metadata
Metadata
Labels
No labels