Much more efficient and clear weight initialization and tie weights #42191

Cyrilvallez · 2025-11-13T16:36:00Z

What does this PR do?

Follow-up to #41580.
This PR is focused on speed and efficiency, as well as clarity. It avoids the following issues/bottlenecks:

No longer need to switch dynamically the class of the Parameters (which is hard to understand and may lead to issues with quantization etc, as it's not always clear if the memory stays the same or if we double it)
Avoids one more loop over all the modules to switch class
Avoids initializing the tied weights (which takes a lonnnngggg time for big embeddings, e.g for "google/gemma-2-2b", the embedding has size 256k * 2304, which is more than 1GB of data in float16 in a single parameter, and it takes more than 3-4s to initialize it with normal_, see profiling below), when we will overwrite them anyway later (with the tied weights)
avoids one more loop over all modules for tied weights (get the correct names in advance)

On that trace, the call to Parameter.normal_ takes 3.4s, and is only the lm_head (the only "missing" weight), even though it's not truly missing because it's a tied weight which is being overwritten later!

So basically, the following snippet

from transformers import AutoModelForCausalLM
import torch
import time

model_id = "google/gemma-2-2b"
device = 1 

torch.cuda.synchronize()
t0 = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, dtype="auto")
torch.cuda.synchronize()
dt = time.time() - t0
print(f"Took {dt:.2f} s")

takes about 7s, when it was taking about 3s on main before #41580 /with the old loading). After this PR, it takes 3s as well, effectively being as performant as before.

HuggingFaceDocBuilderDev · 2025-11-13T16:44:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

very nice, make sure to rebase + models with specific init weight i think are like

transformers/src/transformers/models/rt_detr_v2/modeling_rt_detr_v2.py

Line 475 in 9ccb693

elif isinstance(module, RTDetrV2MultiscaleDeformableAttention):

this one!

ArthurZucker

In general good, you need to overwrite the get and set for tie_words_embeddings to update tied_weight_keys potentially.

ArthurZucker · 2025-11-14T12:58:31Z

src/transformers/initialization.py

+
+    Usually, all models are using the init from `transformers` which are already guarded, but just to make extra sure
+    and for remote code, we also use this context manager.
+    """


this won't work for any tensor manipulation for any remote code / code outside our scope, but its ~~fine~~

Yes I know, this is very unfortunate but we cannot really make it work for remote code 🥲

ArthurZucker · 2025-11-14T13:02:29Z

src/transformers/modeling_utils.py

-            _prefix = f"{self.base_model_prefix}."
-            unexpected_keys = {k.removeprefix(_prefix) for k in unexpected_keys}
+        # Set the flag (very important to avoid initializing them!!)
+        for tied_param in self._tied_weights_keys.keys():


only if tie_words_embeddings or tie_encoder_decoder

ah ok they only exist if you have tie_wwords embeddings. But you need in that case a set and get

Yes they are set correctly in advance in post_init

ArthurZucker

Big bird failing IS related, let's fix.
Also can you update

transformers/utils/check_init_weights_data.py

Line 11 in 082e3ff

# Unless required by applicable law or agreed to in writing, software

please.

Also add a big TODO for the dynamic part I think it is important!

github-actions · 2025-11-14T23:00:27Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, albert, align, altclip, aria, audio_spectrogram_transformer, audioflamingo3, autoformer, bamba, bark, bart, beit, bert, bert_generation, big_bird, bigbird_pegasus

ArthurZucker reviewed Nov 14, 2025

View reviewed changes

Cyrilvallez changed the title ~~Much more efficient and clear weight initialization~~ Much more efficient and clear weight initialization and tie weights Nov 14, 2025

Cyrilvallez added 23 commits November 14, 2025 12:58

everything untilo informer

da26896

everything until perceiver

d561c6f

all of them finally

ceea305

style

187bb8e

replace by transformers init everywhere

2cd2add

use relative import instead

6bdffed

deprecated models

d25fe72

style

82899ac

start contexts

a4ab598

small fixes

192151e

fix modular

5efa9a8

remove class switch

c882d60

do not initialize tied weights

22a55a3

typo

694440b

fix

5a0174e

improve

5423e06

improve comments

9b7ace5

improve

4acef54

improve

c58d243

fix zamba

2edc8c1

fix import

2f40139

add the post_init

2dd4e00

more post_init

3ede287

Cyrilvallez force-pushed the better-init-2 branch from f33c91e to 3ede287 Compare November 14, 2025 12:00

Cyrilvallez added 2 commits November 14, 2025 13:32

fix

86f7169

protect

706799e

ArthurZucker reviewed Nov 14, 2025

View reviewed changes

Cyrilvallez added 9 commits November 14, 2025 14:08

more post_init

1da2d27

fix

83e0ada

fixes

50187a9

fix

16173f0

fix

bae372a

switch flag name

8500bcf

more fixes

cdada86

fixes

99961fc

fixes

557ef75

Cyrilvallez force-pushed the better-init-2 branch from 79e84f9 to 557ef75 Compare November 14, 2025 16:34

Cyrilvallez and others added 6 commits November 14, 2025 17:34

Merge branch 'main' into better-init-2

2dd0817

copies

912440b

fix

acdaf9e

finally find the culprit

cc10ea4

style

627e77b

last small

db42923

ArthurZucker approved these changes Nov 14, 2025

View reviewed changes

Cyrilvallez added 3 commits November 14, 2025 22:50

big bird

17115a2

better

bbdc5a5

update init check

3a12aec

Cyrilvallez added 2 commits November 15, 2025 00:07

final touch

9beb88c

do it everywhere

6092804

Cyrilvallez merged commit 8598421 into main Nov 14, 2025
20 of 24 checks passed

Cyrilvallez deleted the better-init-2 branch November 14, 2025 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Much more efficient and clear weight initialization and tie weights #42191

Much more efficient and clear weight initialization and tie weights #42191

Uh oh!

Cyrilvallez commented Nov 13, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Nov 13, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Nov 14, 2025

Uh oh!

Cyrilvallez Nov 14, 2025

Uh oh!

ArthurZucker Nov 14, 2025

Uh oh!

ArthurZucker Nov 14, 2025

Uh oh!

Cyrilvallez Nov 14, 2025

Uh oh!

ArthurZucker left a comment •

edited

Loading

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Much more efficient and clear weight initialization and tie weights #42191

Much more efficient and clear weight initialization and tie weights #42191

Uh oh!

Conversation

Cyrilvallez commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 13, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Cyrilvallez commented Nov 13, 2025 •

edited

Loading

ArthurZucker left a comment •

edited

Loading