微调gpt2的时候报错
#7847
Replies: 1 comment 3 replies
-
好像是数据集的问题,输入的值太长了 |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
信息如下,单步调试的时候并不能定位错误,在第一个step的时候就会出错:-(
[rank5]: Traceback (most recent call last):
[rank5]: File "/home/yangshijun/LLaMA-Factory-node1/src/llamafactory/launcher.py", line 23, in
[rank5]: launch()
[rank5]: File "/home/yangshijun/LLaMA-Factory-node1/src/llamafactory/launcher.py", line 19, in launch
[rank5]: run_exp()
[rank5]: File "/home/yangshijun/LLaMA-Factory-node1/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank5]: _training_function(config={"args": args, "callbacks": callbacks})
[rank5]: File "/home/yangshijun/LLaMA-Factory-node1/src/llamafactory/train/tuner.py", line 69, in _training_function
[rank5]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank5]: File "/home/yangshijun/LLaMA-Factory-node1/src/llamafactory/train/sft/workflow.py", line 130, in run_sft
[rank5]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank5]: return inner_training_loop(
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
[rank5]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/trainer.py", line 3718, in training_step
[rank5]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank5]: File "/home/yangshijun/LLaMA-Factory-node1/src/llamafactory/train/sft/trainer.py", line 103, in compute_loss
[rank5]: return super().compute_loss(model, inputs, *args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/trainer.py", line 3783, in compute_loss
[rank5]: outputs = model(**inputs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank5]: return forward_call(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward
[rank5]: else self._run_ddp_forward(*inputs, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward
[rank5]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank5]: return forward_call(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward
[rank5]: return model_forward(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in call
[rank5]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
[rank5]: return func(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1062, in forward
[rank5]: transformer_outputs = self.transformer(
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank5]: return self._call_impl(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank5]: return forward_call(*args, **kwargs)
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 829, in forward
[rank5]: attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 378, in _prepare_4d_causal_attention_mask_for_sdpa
[rank5]: ignore_causal_mask = AttentionMaskConverter._ignore_causal_mask_sdpa(
[rank5]: File "/anaconda3/envs/llama_factory_node1/lib/python3.10/site-packages/transformers/modeling_attn_mask_utils.py", line 288, in _ignore_causal_mask_sdpa
[rank5]: elif not is_tracing and torch.all(attention_mask == 1):
[rank5]: RuntimeError: CUDA error: device-side assert triggered
[rank5]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank5]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank5]: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions./pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [18,0,0], thread: [0,0,0] Assertion
srcIndex < srcSelectDimSize
failed./pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [18,0,0], thread: [1,0,0] Assertion
srcIndex < srcSelectDimSize
failed./pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [18,0,0], thread: [2,0,0] Assertion
srcIndex < srcSelectDimSize
failed./pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [18,0,0], thread: [3,0,0] Assertion
srcIndex < srcSelectDimSize
failed./pytorch/aten/src/ATen/native/cuda/Indexing.cu:1422: indexSelectLargeIndex: block: [18,0,0], thread: [4,0,0] Assertion
srcIndex < srcSelectDimSize
failed.Beta Was this translation helpful? Give feedback.
All reactions