Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Tender-Su · 2024-10-30T09:32:38Z

System Info

transformers version: 4.47.0.dev0
Platform: Linux-5.4.0-171-generic-x86_64-with-glibc2.35
Python version: 3.11.10
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A100-SXM4-80GB

Who can help?

@muellerzr @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I'm using LLama-Factory, running Qwen2-VL's multimodal sft tasks, and on an 8*A100 machine, everything works fine when I'm using the 4.45.2 version of transfomers, but when I've updated to the 4.46.0 and higher versions, none of them train properly.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
[rank0]: launch()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank0]: run_exp()
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
[rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank0]: File "/mnt/zj-gpfs/home/sz/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 96, in run_sft
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2122, in train
[rank0]: return inner_training_loop(
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/mnt/zj-gpfs/home/sz/anaconda3/envs/flas/lib/python3.11/site-packages/transformers/trainer.py", line 2453, in _inner_training_loop
[rank0]: self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: a Tensor with 8 elements cannot be converted to Scalar
I looked at blame and the code that went wrong was introduced here #34198
The origin code is self.state.num_input_tokens_seen += torch.sum( self.accelerator.gather( torch.tensor( inputs[main_input_name].numel(), device=self.args.device, dtype=torch.int64 ) ) )
The modified code is input_tokens = inputs[main_input_name].numel() input_tokens = torch.tensor(input_tokens, device=self.args.device, dtype=torch.int64) self.state.num_input_tokens_seen += self.accelerator.gather(input_tokens).cpu().item()

I'm not an expert on this, I consulted GPT and he told me that both codes work the same in a single GPU environment.
In a multi-GPU environment, the first code correctly accumulates the number of input tokens for all devices, while the second code makes an error or only counts the number of tokens for the current device.
Therefore, they are not equivalent in a multi-GPU environment and it is recommended to use the first writeup to ensure correctness in distributed training.

I tried reverting the code back here and rebuilding the transformers and the problem seems to be solved, would love to hear what you guys think!

Expected behavior

Fix this bug and release a new version as soon as possible

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-10-30T16:14:13Z

Can you please give a full reproducer? Also try building off of pip install git+https://github.com/huggingface/transformers@muellerzr-final-gradaccum-check

techkang · 2024-10-31T03:13:59Z

self.accelerator.gather(input_tokens) returns a list and can not be directly added to self.state.num_input_tokens_seen which is a int. A simple fix is using torch.sum(...) to sum up the returned list.

Tender-Su · 2024-10-31T16:38:11Z

self.accelerator.gather(input_tokens) returns a list and can not be directly added to self.state.num_input_tokens_seen which is a int. A simple fix is using torch.sum(...) to sum up the returned list.

This worked, thank you.

ByronHsu · 2024-10-31T21:04:51Z

+1. I think it was introduced by #34198. @muellerzr you can repro by https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface by using the latest transformers version instead of the current hardcoded one

Tender-Su added the bug label Oct 30, 2024

Tender-Su mentioned this issue Oct 31, 2024

RuntimeError: a Tensor with 4 elements cannot be converted to Scalar hiyouga/LLaMA-Factory#5885

Closed

1 task

techkang mentioned this issue Nov 1, 2024

Sum gathered input tokens #34554

Merged

5 tasks

ArthurZucker closed this as completed in #34554 Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Tender-Su commented Oct 30, 2024 •

edited

Loading

muellerzr commented Oct 30, 2024

techkang commented Oct 31, 2024

Tender-Su commented Oct 31, 2024

ByronHsu commented Oct 31, 2024

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Bug in calculating num_input_tokens_seen in multi-gpu environments #34503

Comments

Tender-Su commented Oct 30, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Oct 30, 2024

techkang commented Oct 31, 2024

Tender-Su commented Oct 31, 2024

ByronHsu commented Oct 31, 2024

Tender-Su commented Oct 30, 2024 •

edited

Loading