Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

65B on multiple GPUs : CUDA out of memory with 4 x GPU RTX A5000 (24GB) / 96GB in total #18

Open
scampion opened this issue Mar 14, 2023 · 3 comments

Comments

@scampion
Copy link

For the moment, I can't run the 65B model with 4 GPUs and a total of 96GB.

I investigate,
bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable are a first idea ...

[1] % torchrun --nproc_per_node 4 example.py --ckpt_dir ../../LLaMA/30B --tokenizer_path ../../LLaMA/tokenizer.model
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
/home/scampion/Code/llama/venv/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Allocating transformer on host
Allocating transformer on host
Allocating transformer on host
Allocating transformer on host
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 132, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.08 GiB already allocated; 6.94 MiB free; 5.08 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 129, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.28 GiB already allocated; 6.94 MiB free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 129, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.28 GiB already allocated; 6.94 MiB free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/home/scampion/Code/llama-int8/example.py", line 129, in <module>
    fire.Fire(main)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/scampion/Code/llama-int8/example.py", line 101, in main
    generator = load(ckpt_dir, tokenizer_path, max_seq_len, max_batch_size, use_int8)
  File "/home/scampion/Code/llama-int8/example.py", line 38, in load
    model = Transformer(model_args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 255, in __init__
    self.layers.append(TransformerBlock(layer_id, params))
  File "/home/scampion/Code/llama-int8/llama/model.py", line 206, in __init__
    self.attention = Attention(args)
  File "/home/scampion/Code/llama-int8/llama/model.py", line 129, in __init__
    ).cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 208.00 MiB (GPU 0; 23.68 GiB total capacity; 5.28 GiB already allocated; 6.94 MiB free; 5.28 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 887816) of binary: /home/scampion/Code/llama/venv/bin/python
Traceback (most recent call last):
  File "/home/scampion/Code/llama/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/scampion/Code/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 887817)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 887818)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 887819)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-14_09:55:43
  host      : vector
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 887816)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(venv)
@scampion
Copy link
Author

After recompiled bitsandbytes from source with a compliant version of CUDA 11.7 supported by torch.
The issue is still there

@scampion
Copy link
Author

scampion commented Mar 14, 2023

My mistake, the example.py doesn't support multi GPUs.
WIP

@entn-at
Copy link

entn-at commented Mar 14, 2023

It's complicated. This fork got rid of many things required for multi-GPU usage. One way to restore that would be to create adapted versions of the model parallel layers in fairscale (https://github.com/facebookresearch/fairscale/blob/main/fairscale/nn/model_parallel/layers.py) that use bitsandbytes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants