In Step 1: Warmup training，multiple gpu's trainning

I want to train with multiple gpu's, besides setting the export header="torchrun --nproc_per_node 4 --nnodes 1 and export CUDA_VISIBLE_DEVICES=4,5,6,7，is there anything else I need to set up? Because right now it's showing that my four gpu's with 24G of RAM still don't have enough memory. The training model is using Llama2-7B-HF

trainable params: 134,217,728 || all params: 6,872,641,536 || trainable%: 1.9529278123549145
[train set] examples: 13533; # avg tokens: 370.9773254394531
[train set] examples: 13533; # avg completion tokens: 105.39820861816406
Traceback (most recent call last):
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in <module>
    main()
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
    trainer = Trainer(
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in __init__
    self._move_model_to_device(model, args.device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
    model = model.to(device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 3 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in <module>
    main()
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
    trainer = Trainer(
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in __init__
    self._move_model_to_device(model, args.device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
    model = model.to(device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in <module>
    main()
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
    trainer = Trainer(
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in __init__
    self._move_model_to_device(model, args.device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
    model = model.to(device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 2 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 183, in <module>
    main()
  File "/mnt/users/ylu/XWB/LESS/less/train/train.py", line 152, in main
    trainer = Trainer(
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 456, in __init__
    self._move_model_to_device(model, args.device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/transformers/trainer.py", line 690, in _move_model_to_device
    model = model.to(device)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 1 has a total capacty of 23.67 GiB of which 111.62 MiB is free. Including non-PyTorch memory, this process has 23.56 GiB memory in use. Of the allocated memory 23.36 GiB is allocated by PyTorch, and 1.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-11-03 07:03:40,851] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 991055) of binary: /mnt/users/ylu/anaconda3/envs/xwb_less/bin/python
Traceback (most recent call last):
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/users/ylu/anaconda3/envs/xwb_less/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

In Step 1: Warmup training，multiple gpu's trainning #37

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

In Step 1: Warmup training，multiple gpu's trainning #37

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions