Finetuning LLM  workspace template failed with OOM for LoRA/Llama70B

Launched finetuning job as follows and it failed with OOM Error for Llama-2-70B 
[ray_job_log_job_eqeqt513ex4xy1sgwgcjk8ag1i.log](https://github.com/anyscale/templates/files/15002225/ray_job_log_job_eqeqt513ex4xy1sgwgcjk8ag1i.log)

$ python main.py job_compute_configs/aws.yaml training_configs/lora/llama-2-70b-4k-4xg5_48xlarge.yaml

Error
```
   result = forward_call(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 268, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.75 GiB (GPU 4; 21.99 GiB total capacity; 16.79 GiB already allocated; 907.38 MiB free; 20.71 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finetuning LLM workspace template failed with OOM for LoRA/Llama70B #174

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions