Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Closed
2 of 4 tasks
Tender-Su opened this issue Oct 30, 2024 · 4 comments · Fixed by #34554
Closed
2 of 4 tasks

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Tender-Su opened this issue Oct 30, 2024 · 4 comments · Fixed by #34554
Labels

Comments

@Tender-Su
Copy link

Tender-Su commented Oct 30, 2024

System Info

  • transformers version: 4.47.0.dev0
  • Platform: Linux-5.4.0-171-generic-x86_64-with-glibc2.35
  • Python version: 3.11.10
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@muellerzr @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm using LLama-Factory, running Qwen2-VL's multimodal sft tasks, and on an 8*A100 machine, everything works fine when I'm using the 4.45.2 version of transfomers, but when I've updated to the 4.46.0 and higher versions, none of them train properly.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2453, in _inner_training_loop
[rank0]: self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: a Tensor with 8 elements cannot be converted to Scalar
I looked at blame and the code that went wrong was introduced here #34198
The origin code is self.state.num_input_tokens_seen += torch.sum( self.accelerator.gather( torch.tensor( inputs[main_input_name].numel(), device=self.args.device, dtype=torch.int64 ) ) )
The modified code is input_tokens = inputs[main_input_name].numel() input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64) self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()

I'm not an expert on this, I consulted GPT and he told me that both codes work the same in a single GPU environment.
In a multi-GPU environment, the first code correctly accumulates the number of input tokens for all devices, while the second code makes an error or only counts the number of tokens for the current device.
Therefore, they are not equivalent in a multi-GPU environment and it is recommended to use the first writeup to ensure correctness in distributed training.

I tried reverting the code back here and rebuilding the transformers and the problem seems to be solved, would love to hear what you guys think!

Expected behavior

Fix this bug and release a new version as soon as possible

@Tender-Su Tender-Su added the bug label Oct 30, 2024
@muellerzr
Copy link
Contributor

Can you please give a full reproducer? Also try building off of pip install git+https://github.com/huggingface/transformers@muellerzr-final-gradaccum-check

@techkang
Copy link
Contributor

self.accelerator.gather(input_tokens) returns a list and can not be directly added to self.state.num_input_tokens_seen which is a int. A simple fix is using torch.sum(...) to sum up the returned list.

@Tender-Su
Copy link
Author

self.accelerator.gather(input_tokens) returns a list and can not be directly added to self.state.num_input_tokens_seen which is a int. A simple fix is using torch.sum(...) to sum up the returned list.

This worked, thank you.

@ByronHsu
Copy link
Contributor

+1. I think it was introduced by #34198. @muellerzr you can repro by https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface by using the latest transformers version instead of the current hardcoded one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants