Skip to content

Unexpected behaviour with transformers versions above 4.28 for Donut #39473

@mdavudov

Description

@mdavudov

System Info

Hello,

Big thanks to all the contributors on this repo!

I would like to raise an issue, that was initially encountered when running example notebooks for Donut in Transformer Tutorials (https://github.com/NielsRogge/Transformers-Tutorials) by @NielsRogge . This is issue was previously raised on that repo, but the author advised to re-raise it here. Original issue: NielsRogge/Transformers-Tutorials#496 (comment)

Bug:

The bug was encountered when trying to reproduce results from this notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb

When using newer versions of transformers there is strange behaviour during training, as the model shows much higher validation edit distance values than expected. This is fixed by downgrading to versions 4.28.1 or 4.25.

Reference code uses the following classes from transformers:

  • DonutProcessor
  • VisionEncoderDecoderModel
  • VisionEncoderDecoderConfig

The difference can be seen on the attached screenshot, where the red line shows validation edit distance metric when running on 4.28.1 and the orange one when running on 4.36.0.

Was there any changes introduced after 4.28.1 that could be causing it, and are there any known ways of fixing them?

Image

Environment

Output of transformers env for 4.28.1:

- `transformers` version: 4.28.1
- Platform: Linux-6.1.134-152.225.amzn2023.x86_64-x86_64-with-glibc2.34
- Python version: 3.11.12
- Huggingface_hub version: 0.32.4
- Safetensors version: 0.5.3
- PyTorch version (GPU?): 2.7.1+cu128 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: NO

for 4.36.0 (version where issue is encountered):

- `transformers` version: 4.36.0
- Platform: Linux-6.1.134-152.225.amzn2023.x86_64-x86_64-with-glibc2.34
- Python version: 3.11.12
- Huggingface_hub version: 0.32.4
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.7.1+cu128 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: NO

Thank you for you time, and please let me know what I can do on my end to make it easier to diagnose the issue more precisely.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The bug was encountered when trying to reproduce results from this notebook:

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb

To reproduce:

  1. Follow the notebook as-is, this will install the latest version of transformers
  2. Continue until the training step and run the training
  3. Observe unexpectedly high validation edit distance metrics

To fix:

  1. Pin the transformers version to 4.28.1
  2. Run the notebook again
  3. You should observe a much lower validation edit distance metrics

Expected behavior

I expect the training behaviour to be similar on newer versions of transformers and the performance not to degrade so drastically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions