-
Notifications
You must be signed in to change notification settings - Fork 178
Open
Description
Hi, I am trying to finetune the SPHINX-tiny-1k model with my own vision-language QA dataset with two RTX 3090 (I think 48GB VRAM should be capable of this) with the following script:
pretrained_path=checkpoint/SPHINX-Tiny-1k
pretrained_type=consolidated
llama_config="checkpoint/SPHINX-Tiny-1k/config.json"
tokenizer_path="checkpoint/SPHINX-Tiny-1k/tokenizer.model"
data_config=configs/data/finetune/train.yaml
llama_type=llama_ens5_light
data_parallel=sdp
model_parallel=2
exp_name=llama_ens5_light_13b_esd
echo "exp name: $exp_name"
mkdir -p output/"$exp_name"
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=2 main_finetune.py \
--output_dir output/"$exp_name" --epochs 1 --warmup_epochs 0.01 \
--batch_size 16 --accum_iter 4 --num_workers 4 \
--max_words 3072 \
--lr 2e-5 --min_lr 0 --clip_grad 8 --weight_decay 0 \
--data_parallel "$data_parallel" --model_parallel_size "$model_parallel" --checkpointing \
--llama_type $llama_type --llama_config "$llama_config" --tokenizer_path "$tokenizer_path" \
--pretrained_path "$pretrained_path" --pretrained_type="$pretrained_type" \
--dialog \
--data_config $data_config \
--image_transform padded_resize \
2>&1 | tee -a output/"$exp_name"/output.log
echo "exp name: $exp_name"
The training works fine at the beginning, with successful dataset loading and model loading. However, after training for several iterations of the first epoch, the training automatically stopped with the uninterpretable signal 11, in which I cannot really locate the error:
[16:59:27.125714] <accessory.data.conversation.dataset.FinetuneDialogDataset object at 0x7fb543e1c5e0>
[16:59:27.128154] Start training for 1 epochs
[16:59:27.136068] log_dir: output/llama_ens5_light_13b_esd
/home/admin3090/Chat/accessory/engine_finetune.py:41: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
"bf16": torch.cuda.amp.autocast(dtype=torch.bfloat16),
/home/admin3090/Chat/accessory/engine_finetune.py:42: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
"fp16": torch.cuda.amp.autocast(dtype=torch.float16),
/home/admin3090/Chat/accessory/engine_finetune.py:41: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
"bf16": torch.cuda.amp.autocast(dtype=torch.bfloat16),
/home/admin3090/Chat/accessory/engine_finetune.py:42: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
"fp16": torch.cuda.amp.autocast(dtype=torch.float16),
[16:59:47.230226] Epoch: [0] [0/1032] lr: 0.00000000 closs: 0.2291 (0.2291) time: 20.0930 data: 1.2812 max mem: 13506
[17:01:14.500477] Epoch: [0] [10/1032] lr: 0.00001550 closs: 0.2314 (0.2138) grad_norm: 65.6548 (75.2946) time: 9.7602 data: 0.1167 max mem: 18924
[17:02:46.310460] Epoch: [0] [20/1032] lr: 0.00002000 closs: 0.2594 (0.9128) grad_norm: 74.9498 (70.0313) time: 9.4843 data: 0.0612 max mem: 18925
[17:04:22.817312] Epoch: [0] [30/1032] lr: 0.00001999 closs: 0.2692 (0.9525) grad_norm: 74.9498 (62.4132) time: 9.5379 data: 0.0415 max mem: 18927
[17:06:00.424761] Epoch: [0] [40/1032] lr: 0.00001996 closs: 0.2651 (0.7938) grad_norm: 65.6548 (53.8146) time: 9.5923 data: 0.0314 max mem: 18927
[17:07:43.731238] Epoch: [0] [50/1032] lr: 0.00001993 closs: 0.2603 (0.6898) grad_norm: 40.6184 (47.8622) time: 9.7370 data: 0.0253 max mem: 18927
[17:09:29.007985] Epoch: [0] [60/1032] lr: 0.00001988 closs: 0.2594 (0.6501) grad_norm: 33.3648 (42.0804) time: 9.8666 data: 0.0212 max mem: 18927
[17:11:20.009014] Epoch: [0] [70/1032] lr: 0.00001984 closs: 0.2491 (0.6388) grad_norm: 24.1987 (38.8844) time: 10.0403 data: 0.0182 max mem: 18927
[17:13:12.229725] Epoch: [0] [80/1032] lr: 0.00001977 closs: 0.2486 (0.6420) grad_norm: 20.2569 (40.3147) time: 10.1862 data: 0.0160 max mem: 18927
[17:15:02.561021] Epoch: [0] [90/1032] lr: 0.00001972 closs: 0.2486 (0.5982) grad_norm: 18.5175 (36.9511) time: 10.2792 data: 0.0143 max mem: 18927
[17:16:52.212519] Epoch: [0] [100/1032] lr: 0.00001962 closs: 0.2692 (0.6277) grad_norm: 18.3958 (33.8891) time: 10.3471 data: 0.0129 max mem: 18927
[17:18:43.665149] Epoch: [0] [110/1032] lr: 0.00001955 closs: 0.2692 (0.6057) grad_norm: 18.3958 (32.8741) time: 10.4190 data: 0.0117 max mem: 18928
[17:20:37.711438] Epoch: [0] [120/1032] lr: 0.00001944 closs: 0.2651 (0.5788) grad_norm: 15.0242 (29.9831) time: 10.5005 data: 0.0108 max mem: 18928
[17:22:30.155681] Epoch: [0] [130/1032] lr: 0.00001935 closs: 0.2473 (0.5367) grad_norm: 8.6509 (28.2386) time: 10.5573 data: 0.0100 max mem: 18928
[17:24:23.366792] Epoch: [0] [140/1032] lr: 0.00001922 closs: 0.2473 (0.5356) grad_norm: 8.6509 (26.6344) time: 10.6114 data: 0.0093 max mem: 18928
[17:26:13.671742] Epoch: [0] [150/1032] lr: 0.00001912 closs: 0.2403 (0.5152) grad_norm: 8.4717 (25.2288) time: 10.6392 data: 0.0087 max mem: 18928
[17:28:06.069605] Epoch: [0] [160/1032] lr: 0.00001896 closs: 0.2403 (0.5065) grad_norm: 8.4717 (24.1122) time: 10.6765 data: 0.0081 max mem: 18928
[17:29:53.421809] Epoch: [0] [170/1032] lr: 0.00001885 closs: 0.2473 (0.5188) grad_norm: 8.4717 (23.5638) time: 10.6799 data: 0.0077 max mem: 18928
[17:31:39.230066] Epoch: [0] [180/1032] lr: 0.00001867 closs: 0.2486 (0.5062) grad_norm: 8.4815 (23.2716) time: 10.6744 data: 0.0073 max mem: 18928
[17:33:24.598415] Epoch: [0] [190/1032] lr: 0.00001854 closs: 0.2491 (0.4992) grad_norm: 8.4815 (22.6958) time: 10.6672 data: 0.0069 max mem: 18928
[17:35:18.886923] Epoch: [0] [200/1032] lr: 0.00001835 closs: 0.2603 (0.5207) grad_norm: 8.4815 (22.7723) time: 10.7051 data: 0.0065 max mem: 18928
[17:37:14.505860] Epoch: [0] [210/1032] lr: 0.00001821 closs: 0.2692 (0.5399) grad_norm: 8.6509 (22.6800) time: 10.7457 data: 0.0062 max mem: 18928
[17:39:06.081179] Epoch: [0] [220/1032] lr: 0.00001799 closs: 0.3310 (0.5411) grad_norm: 14.3802 (22.3858) time: 10.7643 data: 0.0060 max mem: 18928
[17:40:56.543461] Epoch: [0] [230/1032] lr: 0.00001784 closs: 0.3835 (0.5560) grad_norm: 14.3802 (24.1329) time: 10.7765 data: 0.0057 max mem: 18928
[17:42:47.956180] Epoch: [0] [240/1032] lr: 0.00001761 closs: 0.3902 (0.5479) grad_norm: 13.0119 (24.1339) time: 10.7916 data: 0.0055 max mem: 18928
W0507 17:44:30.794000 21122 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 21152 closing signal SIGTERM
E0507 17:44:31.110000 21122 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -11) local_rank: 1 (pid: 21153) of binary: /home/admin3090/anaconda3/envs/chat_2/bin/python3.10
Traceback (most recent call last):
File "/home/admin3090/anaconda3/envs/chat_2/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in main
run(args)
File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
elastic_launch(
File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/admin3090/anaconda3/envs/chat_2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
main_finetune.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-07_17:44:30
host : admin3090
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 21153)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 21153
=======================================================
The current training environment config is:
Ubuntu 20.04
Python 3.10
CUDA 12.1
Torch 2.7.0
I would really appreciate it if anyone could help with this weird error...
Metadata
Metadata
Assignees
Labels
No labels