-
Notifications
You must be signed in to change notification settings - Fork 32.6k
Closed
Labels
Description
System Info
transformersversion: 4.54.1- Platform: Linux-5.15.0-131-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 0.34.3
- Safetensors version: 0.5.3
- Accelerate version: 1.10.0
- Accelerate config: not found
- DeepSpeed version: 0.17.4
- PyTorch version (accelerator?): 2.8.0a0+5228986c39.nv25.06 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA H100 80GB HBM3
Who can help?
Previous PRs from: #35207 and #34511
It makes the backward() called after rescaling. This creates a double rescaling both here and in Accelerate:
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When:
- gradient_accumulation_steps > 1
- not using deepspeed
- num_items_in_batch is None and self.compute_loss_func is None (i.e., when user ignores the GA loss bug)
The final loss is rescaled twice:
loss = loss / gradient_accumulation_steps
Expected behavior
It should be rescaled only once.
Reactions are currently unavailable