-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
Hi, many thanks for your great work.
I am trying to use the default script for training. I find that even if I use batch_size=1, training runs out of memory. I am wondering what might cause the problem. I'd appreciate any suggestions.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/workspace/xxxx/Lumina-mGPT-main/lumina_mgpt/finetune_solver.py", line 114, in <module>
[rank0]: solver.run()
[rank0]: File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 518, in run
[rank0]: train_stats = self.train_one_epoch(
[rank0]: File "/mnt/workspace/xxxx/Lumina-mGPT-main/xllmx/solvers/finetune/finetune.py", line 620, in train_one_epoch
[rank0]: self.optimizer.step()
[rank0]: File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]: out = func(*args, **kwargs)
[rank0]: File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]: ret = func(self, *args, **kwargs)
[rank0]: File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 177, in step
[rank0]: has_complex = self._init_group(
[rank0]: File "/mnt/workspace/xxxx/conda-envs/lumina-mgpt-5/lib/python3.10/site-packages/torch/optim/adamw.py", line 128, in _init_group
[rank0]: state["exp_avg_sq"] = torch.zeros_like(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU
exp name: 7B-8
Metadata
Metadata
Assignees
Labels
No labels