Skip to content

Out of memory using the default training configuration #28

@JacobYuan7

Description

@JacobYuan7

Hi, many thanks for your great work.

I am trying to use the default script for training. I find that even if I use batch_size=1, training runs out of memory. I am wondering what might cause the problem. I'd appreciate any suggestions.


[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/lumina_mgpt/finetune_solver.py", line 114, in <module>
[rank0]:     solver.run()
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 518, in run
[rank0]:     train_stats = self.train_one_epoch(
[rank0]:   File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 620, in train_one_epoch
[rank0]:     self.optimizer.step()
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]:     ret = func(self, *args, **kwargs)
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
[rank0]:     has_complex = self._init_group(
[rank0]:   File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
[rank0]:     state["exp_avg_sq"] = torch.zeros_like(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 
exp name: 7B-8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions