Skip to content

Conversation

Addyk-24
Copy link

@Addyk-24 Addyk-24 commented Oct 11, 2025

What does this PR do?

Fixes #41492

Fixes incorrect target language generation during evaluation/validation in run_translation.py for multilingual translation models (mBART , M2M100).

Problem

When fine-tuning multilingual models, forced_bos_token_id was only set in model.config but not in model.generation_config. During evaluation, model.generate() reads from generation_config, causing generation in wrong language and artificially low BLEU scores.Previously would be ~2-5 (wrong language)

Solution

Set forced_bos_token_id in both model.config and model.generation_config.

Results:

eval_metrics
  • ✅ this generated correct language token id with correct target language.
  • ✅ This warning appears if you modify model.config directly for generation. Using model.generation_config removes this warning and ensures Transformers v5+ uses the setting correctly.
  • ✅ All evaluations complete without errors
  • ✅ Using decoder_start_token_id only/both causes empty outputs.
  • ✅ With this fix, the target language ID is automatically handled during generation.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@zach-huggingface @Cyrilvallez

@Bmingg
Copy link

Bmingg commented Oct 11, 2025

What's worked for me is that I set model.generation_config.decoder_start_token_id to the target language ID of MBart. When I discovered this bug, I think I checked the forced_bos_token_id of the output of MBart, and it should still be the start token s>. In this case, I think what was missing was the target language ID after start token, if I remembered correctly.

@Addyk-24
Copy link
Author

What's worked for me is that I set model.generation_config.decoder_start_token_id to the target language ID of MBart. When I discovered this bug, I think I checked the forced_bos_token_id of the output of MBart, and it should still be the start token s>. In this case, I think what was missing was the target language ID after start token, if I remembered correctly.

Thanks for sharing your experience! I ran custom comprehensive tests to verify the correct fix, and here's what I found:

Test Performed :

I tested 4 different approaches on facebook/mbart-large-50-one-to-many-mmt translating en_XX → de_DE:

  1. Baseline (no fix): First 5 token IDs:

    • [2, 250002, 64681, 4, 1199] -> this generated wrong target language
  2. Setting forced_bos_token_id in generation_config (my current fix):

    • First 5 token IDs: [2, 250003, 54029, 4, 1225] -> this generated correct language token id with correct target language.
  3. Setting decoder_start_token_id only:

    • First 5 token IDs: [250003, 2] -> Generated: "" (empty/broken output)
  4. Setting both:

    • First 5 token IDs: [250003, 250003, 2] -> Generated: "" (broken - duplicates language token)

Below is the fix that i have done :

image

Result :

  • The forced_bos_token_id parameter controls what comes after the EOS token (which is the target language ID).
  • While decoder_start_token_id seems to break the generation flow when set to the language ID.

Conclusion

  • Based on tests, this PR ensures correct target language generation for mBART by applying generation_config.forced_bos_token_id.
  • Using decoder_start_token_id instead causes invalid or empty outputs.
  • With this fix, the target language ID is automatically handled during generation.

@Cyrilvallez
Copy link
Member

@Addyk-24 do you mind reverting all unrelated changes please? 🤗 I.e. all style changes (newlines etc) so that we can see the clear diff

@Addyk-24 Addyk-24 closed this Oct 13, 2025
@Addyk-24 Addyk-24 force-pushed the fix/model_generation_config_fix branch from 1198033 to 3927ffe Compare October 13, 2025 16:23
@Addyk-24 Addyk-24 reopened this Oct 13, 2025
@Addyk-24
Copy link
Author

Addyk-24 commented Oct 13, 2025

@Addyk-24 do you mind reverting all unrelated changes please? 🤗 I.e. all style changes (newlines etc) so that we can see the clear diff

@Cyrilvallez Done! I've reverted all unrelated formatting changes. The PR now only includes the fix for setting forced_bos_token_id in generation_config, so the language ID is handled automatically instead of manually. I've also performed 4 custom tests to verify this behavior. Please let me know if any further adjustments are needed. Thanks 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

For finetuning MBart-based model, setting decoder_start_token_id in model.config is NOT ENOUGH.

3 participants