Skip to content

Conversation

@www-spam
Copy link

Summary

This PR fixes the issue where merge_and_unload() produces broken models when adapters are applied to both embed_tokens and lm_head on models with tie_word_embeddings=True.

Resolves #2777

Problem

When a base model has tie_word_embeddings=True (e.g., Gemma, Llama):

  1. embed_tokens and lm_head share the same weight tensor
  2. Adapters can be applied to both layers (via modules_to_save or target_modules)
  3. After training, each layer has different adapter deltas
  4. merge_and_unload() merges both layers with their respective deltas
  5. Bug: The config still has tie_word_embeddings=True
  6. When the merged model is loaded with AutoModel.from_pretrained(), the lm_head weights are overwritten with embed_tokens weights due to weight tying
  7. Result: The merged lm_head weights are lost, causing degraded or garbage output

Solution

This PR modifies _unload_and_optionally_merge() in BaseTuner to:

  1. Detect if both embedding-like and lm_head-like modules have adapters
  2. Untie the weights by cloning lm_head.weight before merge
  3. Update config.tie_word_embeddings = False in all relevant config locations

This ensures that:

  • Merged weights are preserved for both layers
  • The saved model can be loaded correctly
  • Backward compatibility is maintained (no change when embeddings aren't both targeted)

Changes

  • src/peft/tuners/tuners_utils.py:

    • Added _untie_embedding_weights() helper method
    • Added _update_tie_word_embeddings_config() helper method
    • Added _has_adapters_on_both_embeddings() helper method
    • Modified _unload_and_optionally_merge() to auto-handle tied embeddings
  • tests/test_tie_word_embeddings_merge.py:

    • Added tests for tie_word_embeddings merge behavior

Test Plan

  • Tested with Gemma 3 4B model (tie_word_embeddings=True)
  • Verified merged model produces coherent output
  • Verified config.tie_word_embeddings is correctly set to False
  • Verified embed_tokens and lm_head have independent weights after merge
  • Unit tests added

Example

Before this fix:

# Model with tie_word_embeddings=True + adapters on embed_tokens and lm_head
merged = peft_model.merge_and_unload()
merged.save_pretrained("merged_model")

# Loading produces broken model
loaded = AutoModel.from_pretrained("merged_model")  # lm_head weights lost!

After this fix:

# Same setup
merged = peft_model.merge_and_unload()  # Auto-unties and updates config
merged.save_pretrained("merged_model")

# Loading works correctly
loaded = AutoModel.from_pretrained("merged_model")  # Works as expected

…pted

When LoRA is applied to both embed_tokens and lm_head on models with
tie_word_embeddings=True, merge_and_unload() now automatically:
- Detects if both layers have adapters
- Unties the weights before merging
- Sets config.tie_word_embeddings=False

This prevents the merged lm_head weights from being lost when the
model is reloaded.

Resolves huggingface#2777
@www-spam www-spam closed this Dec 31, 2025
@www-spam www-spam reopened this Dec 31, 2025
- Check both target_modules and modules_to_save when detecting adapters
  on embed_tokens and lm_head
- Always update config when adapters are on both layers (ModulesToSaveWrapper
  already unties weights, so we just need to update config)
- Update warning message for clarity
@romitjain
Copy link
Contributor

@www-spam
I am curious to know if you have tried the flag proposed in the solution of the issue (ensure_weight_tying=True in LoraConfig)
It was added via this PR: #2803 and should be available in the latest release. Let me know if that solved the issue for you.

@www-spam
Copy link
Author

www-spam commented Jan 5, 2026

I tested ensure_weight_tying=True, but it doesn't apply to this case for two main reasons:

1. Technical Limitation: It only applies to modules_to_save

ensure_weight_tying is designed for modules_to_save, not target_modules.

According to lora/config.py:

ensure_weight_tying: bool = field(
    ...
    metadata={
        "help": (
            "...This is only applicable for layers passed via "
            "\`modules_to_save\`."
        )
    },
)

When used with target_modules=["embed_tokens", "lm_head"], it triggers this warning and has no effect:

UserWarning: You have requested `ensure_weight_tying`, but no tied modules are added in `modules_to_save`

2. Conceptual Limitation: Independent training is required

Even if it worked for target_modules, ensure_weight_tying=True forces adapters to share identical weights. This breaks use cases that require independent training, such as:

  • Custom token learning: New tokens often require different input vs. output embeddings.
  • Asymmetric fine-tuning: embed_tokens and lm_head may benefit from learning different deltas during optimization.

Conclusion

This PR handles the target_modules case by automatically setting config.tie_word_embeddings=False on merge. This ensures that the distinct weights learned for input and output are correctly preserved upon reload.

@romitjain
Copy link
Contributor

@www-spam Re:

  1. Yes, it makes sense. This is being worked upon here: ENH: Tie weights for target_modules in Lora (#2864) #2879. After the merge, ensure_weight_tying should work for target_modules too.
  2. I have a different opinion on this - if the model to be tuned has tied embeddings, ensure_weight_tying = True just makes sure that the model architecture does not break. For both custom token learning and assymetric fine tuning.a better way might be to modify the model to be tuned config to break the tied embedding. Setting config.tie_word_embeddings=False before tuning the model.

I don't think that this should be done as a default by setting config.tie_word_embeddings=False on merge because the downstream tasks might still assume the original model's config.

WDYT?

I think @BenjaminBossan might have some views on this too.

@BenjaminBossan
Copy link
Member

Thanks for opening the PR @www-spam and your discussion @romitjain

Regarding point 2. I tend to agree with Romit. Implicitly untying the weights here could be surprising for users. As merge_and_unload would be the last step in a possibly very long training process, this could result in a lot of lost time. Yes, we could argue it's the users fault in that case but it's not particularly user friendly. I see that there can be a legitimate need for these use cases but would agree it's better if the user makes this decision ahead of time and explicitly.

What we could ensure on the PEFT side is that if the user targets the tied layers with, say, LoRA, they get a warning that this means that merging and unloading won't work properly. WDYT?

@www-spam
Copy link
Author

@romitjain @BenjaminBossan

Thanks for pointing to #2879. I reviewed it and I think our PRs address different scenarios.

Different goals:

#2879 extends ensure_weight_tying to target_modules, which keeps LoRA adapters tied together — both embed_tokens and lm_head share the same delta. This is useful when you want to preserve the original tied architecture during fine-tuning.

My use case requires the opposite: independent deltas for embed_tokens and lm_head. When adding domain-specific tokens to the vocabulary, input embeddings need to learn "what this token means" while output embeddings learn "when to generate this token." These diverge during training, and that's intentional.

What actually happened:

I did follow the recommended approach — I set tie_word_embeddings=False before training:

config = AutoConfig.from_pretrained(model_path)
config.tie_word_embeddings = False
model = AutoModelForCausalLM.from_pretrained(model_path, config=config)
# Train with LoRA on embed_tokens and lm_head...

Training worked fine. The problem is after merge_and_unload():

  1. Training with tie_word_embeddings=False — works correctly
  2. merge_and_unload() — completes without error
  3. save_pretrained() — saved config doesn't reflect the training config
  4. from_pretrained() on reload — uses base model's tie_word_embeddings=True
  5. Weights get re-tied, lm_head is overwritten with embed_tokens → model outputs garbage

On the config concern:

I understand the concern about downstream compatibility. But consider this: if someone trained with independent embed_tokens/lm_head, the merged model is architecturally different from the base model. The config should reflect that.

Keeping tie_word_embeddings=True when the weights have actually diverged causes silent model corruption on reload. Updating the config to match the actual model state seems like the safer default.

How these PRs relate:

User intent Solution
Keep adapters tied during training #2879 (ensure_weight_tying=True)
Train independent adapters, preserve on merge This PR

I see them as complementary. Happy to discuss alternative approaches if you have other ideas.

@romitjain
Copy link
Contributor

@www-spam Agree on the use case.
I am just adding my thoughts as a PEFT user. I think a better solve might be to preserve the model config when saving the model.

Specifically,

save_pretrained() — saved config doesn't reflect the training config

There should be an implementation to save the config that was provided as per the user. In case the user provides model.config.tie_word_embeddings = False, the final saved config should also reflect that.
This way, the downstream tasks will be as expected and in accordance with your suggestions.

I think you are already doing this in your implementation. IMO, what should not happen is that if the user adds adapters to both of the tied layers, then it automatically breaks the tying (irrespective of whether the user added ensure_weight_tying = True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update the model config to tie_word_embeddings = False in case of lm_head or embedding layer update

3 participants