Skip to content

"Signal 11 received" when finetuning the SPHINX-tiny-1k model with two RTX3090 #211

@hoho2312

Description

@hoho2312

Hi, I am trying to finetune the SPHINX-tiny-1k model with my own vision-language QA dataset with two RTX 3090 (I think 48GB VRAM should be capable of this) with the following script:

pretrained_path=checkpoint/SPHINX-Tiny-1k
pretrained_type=consolidated
llama_config="checkpoint/SPHINX-Tiny-1k/config.json"
tokenizer_path="checkpoint/SPHINX-Tiny-1k/tokenizer.model"
data_config=configs/data/finetune/train.yaml
llama_type=llama_ens5_light

data_parallel=sdp
model_parallel=2

exp_name=llama_ens5_light_13b_esd
echo "exp name: $exp_name"
mkdir -p output/"$exp_name"

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=2 main_finetune.py \
--output_dir output/"$exp_name" --epochs 1 --warmup_epochs 0.01 \
--batch_size 16 --accum_iter 4 --num_workers 4 \
--max_words 3072 \
--lr 2e-5 --min_lr 0 --clip_grad 8 --weight_decay 0 \
--data_parallel "$data_parallel" --model_parallel_size "$model_parallel" --checkpointing \
--llama_type $llama_type --llama_config "$llama_config" --tokenizer_path "$tokenizer_path" \
--pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" \
--dialog    \
--data_config $data_config \
--image_transform padded_resize  \
2>&1 | tee -a output/"$exp_name"/output.log

echo "exp name: $exp_name"

The training works fine at the beginning, with successful dataset loading and model loading. However, after training for several iterations of the first epoch, the training automatically stopped with the uninterpretable signal 11, in which I cannot really locate the error:

[16:59:27.125714] <accessory.data.conversation.dataset.FinetuneDialogDataset object at 0x7fb543e1c5e0>
[16:59:27.128154] Start training for 1 epochs
[16:59:27.136068] log_dir: output/llama_ens5_light_13b_esd
/home/admin3090/Chat/accessory/engine_finetune.py:41: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  "bf16": torch.cuda.amp.autocast(dtype=torch.bfloat16),
/home/admin3090/Chat/accessory/engine_finetune.py:42: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  "fp16": torch.cuda.amp.autocast(dtype=torch.float16),
/home/admin3090/Chat/accessory/engine_finetune.py:41: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  "bf16": torch.cuda.amp.autocast(dtype=torch.bfloat16),
/home/admin3090/Chat/accessory/engine_finetune.py:42: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
  "fp16": torch.cuda.amp.autocast(dtype=torch.float16),
[16:59:47.230226] Epoch: [0]  [0/1032]  lr: 0.00000000  closs: 0.2291 (0.2291)  time: 20.0930  data: 1.2812  max mem: 13506
[17:01:14.500477] Epoch: [0]  [10/1032]  lr: 0.00001550  closs: 0.2314 (0.2138)  grad_norm: 65.6548 (75.2946)  time: 9.7602  data: 0.1167  max mem: 18924
[17:02:46.310460] Epoch: [0]  [20/1032]  lr: 0.00002000  closs: 0.2594 (0.9128)  grad_norm: 74.9498 (70.0313)  time: 9.4843  data: 0.0612  max mem: 18925
[17:04:22.817312] Epoch: [0]  [30/1032]  lr: 0.00001999  closs: 0.2692 (0.9525)  grad_norm: 74.9498 (62.4132)  time: 9.5379  data: 0.0415  max mem: 18927
[17:06:00.424761] Epoch: [0]  [40/1032]  lr: 0.00001996  closs: 0.2651 (0.7938)  grad_norm: 65.6548 (53.8146)  time: 9.5923  data: 0.0314  max mem: 18927
[17:07:43.731238] Epoch: [0]  [50/1032]  lr: 0.00001993  closs: 0.2603 (0.6898)  grad_norm: 40.6184 (47.8622)  time: 9.7370  data: 0.0253  max mem: 18927
[17:09:29.007985] Epoch: [0]  [60/1032]  lr: 0.00001988  closs: 0.2594 (0.6501)  grad_norm: 33.3648 (42.0804)  time: 9.8666  data: 0.0212  max mem: 18927
[17:11:20.009014] Epoch: [0]  [70/1032]  lr: 0.00001984  closs: 0.2491 (0.6388)  grad_norm: 24.1987 (38.8844)  time: 10.0403  data: 0.0182  max mem: 18927
[17:13:12.229725] Epoch: [0]  [80/1032]  lr: 0.00001977  closs: 0.2486 (0.6420)  grad_norm: 20.2569 (40.3147)  time: 10.1862  data: 0.0160  max mem: 18927
[17:15:02.561021] Epoch: [0]  [90/1032]  lr: 0.00001972  closs: 0.2486 (0.5982)  grad_norm: 18.5175 (36.9511)  time: 10.2792  data: 0.0143  max mem: 18927
[17:16:52.212519] Epoch: [0]  [100/1032]  lr: 0.00001962  closs: 0.2692 (0.6277)  grad_norm: 18.3958 (33.8891)  time: 10.3471  data: 0.0129  max mem: 18927
[17:18:43.665149] Epoch: [0]  [110/1032]  lr: 0.00001955  closs: 0.2692 (0.6057)  grad_norm: 18.3958 (32.8741)  time: 10.4190  data: 0.0117  max mem: 18928
[17:20:37.711438] Epoch: [0]  [120/1032]  lr: 0.00001944  closs: 0.2651 (0.5788)  grad_norm: 15.0242 (29.9831)  time: 10.5005  data: 0.0108  max mem: 18928
[17:22:30.155681] Epoch: [0]  [130/1032]  lr: 0.00001935  closs: 0.2473 (0.5367)  grad_norm: 8.6509 (28.2386)  time: 10.5573  data: 0.0100  max mem: 18928
[17:24:23.366792] Epoch: [0]  [140/1032]  lr: 0.00001922  closs: 0.2473 (0.5356)  grad_norm: 8.6509 (26.6344)  time: 10.6114  data: 0.0093  max mem: 18928
[17:26:13.671742] Epoch: [0]  [150/1032]  lr: 0.00001912  closs: 0.2403 (0.5152)  grad_norm: 8.4717 (25.2288)  time: 10.6392  data: 0.0087  max mem: 18928
[17:28:06.069605] Epoch: [0]  [160/1032]  lr: 0.00001896  closs: 0.2403 (0.5065)  grad_norm: 8.4717 (24.1122)  time: 10.6765  data: 0.0081  max mem: 18928
[17:29:53.421809] Epoch: [0]  [170/1032]  lr: 0.00001885  closs: 0.2473 (0.5188)  grad_norm: 8.4717 (23.5638)  time: 10.6799  data: 0.0077  max mem: 18928
[17:31:39.230066] Epoch: [0]  [180/1032]  lr: 0.00001867  closs: 0.2486 (0.5062)  grad_norm: 8.4815 (23.2716)  time: 10.6744  data: 0.0073  max mem: 18928
[17:33:24.598415] Epoch: [0]  [190/1032]  lr: 0.00001854  closs: 0.2491 (0.4992)  grad_norm: 8.4815 (22.6958)  time: 10.6672  data: 0.0069  max mem: 18928
[17:35:18.886923] Epoch: [0]  [200/1032]  lr: 0.00001835  closs: 0.2603 (0.5207)  grad_norm: 8.4815 (22.7723)  time: 10.7051  data: 0.0065  max mem: 18928
[17:37:14.505860] Epoch: [0]  [210/1032]  lr: 0.00001821  closs: 0.2692 (0.5399)  grad_norm: 8.6509 (22.6800)  time: 10.7457  data: 0.0062  max mem: 18928
[17:39:06.081179] Epoch: [0]  [220/1032]  lr: 0.00001799  closs: 0.3310 (0.5411)  grad_norm: 14.3802 (22.3858)  time: 10.7643  data: 0.0060  max mem: 18928
[17:40:56.543461] Epoch: [0]  [230/1032]  lr: 0.00001784  closs: 0.3835 (0.5560)  grad_norm: 14.3802 (24.1329)  time: 10.7765  data: 0.0057  max mem: 18928
[17:42:47.956180] Epoch: [0]  [240/1032]  lr: 0.00001761  closs: 0.3902 (0.5479)  grad_norm: 13.0119 (24.1339)  time: 10.7916  data: 0.0055  max mem: 18928
W0507 17:44:30.794000 21122 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 21152 closing signal SIGTERM
E0507 17:44:31.110000 21122 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -11) local_rank: 1 (pid: 21153) of binary: /home/admin3090/anaconda3/envs/chat_2/bin/python3.10
Traceback (most recent call last):
  File "/home/admin3090/anaconda3/envs/chat_2/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
main_finetune.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-07_17:44:30
  host      : admin3090
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 21153)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 21153
=======================================================

The current training environment config is:
Ubuntu 20.04
Python 3.10
CUDA 12.1
Torch 2.7.0

I would really appreciate it if anyone could help with this weird error...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions