-
Notifications
You must be signed in to change notification settings - Fork 28.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in calculating num_input_tokens_seen in multi-gpu environments #34503
Comments
Can you please give a full reproducer? Also try building off of |
|
This worked, thank you. |
+1. I think it was introduced by #34198. @muellerzr you can repro by https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface by using the latest |
System Info
transformers
version: 4.47.0.dev0Who can help?
@muellerzr @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm using LLama-Factory, running Qwen2-VL's multimodal sft tasks, and on an 8*A100 machine, everything works fine when I'm using the 4.45.2 version of transfomers, but when I've updated to the 4.46.0 and higher versions, none of them train properly.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2453, in _inner_training_loop
[rank0]: self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: a Tensor with 8 elements cannot be converted to Scalar
I looked at blame and the code that went wrong was introduced here #34198
The origin code is
self.state.num_input_tokens_seen += torch.sum( self.accelerator.gather( torch.tensor( inputs[main_input_name].numel(), device=self.args.device, dtype=torch.int64 ) ) )
The modified code is
input_tokens = inputs[main_input_name].numel() input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64) self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()
I'm not an expert on this, I consulted GPT and he told me that both codes work the same in a single GPU environment.
In a multi-GPU environment, the first code correctly accumulates the number of input tokens for all devices, while the second code makes an error or only counts the number of tokens for the current device.
Therefore, they are not equivalent in a multi-GPU environment and it is recommended to use the first writeup to ensure correctness in distributed training.
I tried reverting the code back here and rebuilding the transformers and the problem seems to be solved, would love to hear what you guys think!
Expected behavior
Fix this bug and release a new version as soon as possible
The text was updated successfully, but these errors were encountered: