Question about fixed std=0.02 initialization of `w1` in `moe.py` #1257

trestad · 2025-06-03T04:06:53Z

Hi torchtitan team,

Thanks for the great work on this project! I had a question regarding a detail in the code at moe.py#L92

torchtitan/torchtitan/experiments/llama4/model/moe.py

Line 92 in 768cde1

nn.init.trunc_normal_(self.w1, mean=0.0, std=0.02)

I noticed that w1 is initialized with a fixed standard deviation of 0.02, whereas w2 and w3 are initialized using a configurable init_std parameter. I’m wondering if this discrepancy is intentional, and if so, what the reasoning is behind using a hardcoded value for w1.

Would greatly appreciate any insights you could share!

Thanks again!

The text was updated successfully, but these errors were encountered:

tianyu-l · 2025-06-03T08:22:59Z

I copied this from Llama 3 FFN init code at https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama3/model.py#L305

@lessw2020 do you have more context on the choice of std for w1, w2, w3?

tianyu-l added the question Further information is requested label Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about fixed std=0.02 initialization of `w1` in `moe.py` #1257

Question about fixed std=0.02 initialization of `w1` in `moe.py` #1257

trestad commented Jun 3, 2025

tianyu-l commented Jun 3, 2025

Uh oh!

Question about fixed std=0.02 initialization of w1 in moe.py #1257

Question about fixed std=0.02 initialization of w1 in moe.py #1257

Comments

trestad commented Jun 3, 2025

tianyu-l commented Jun 3, 2025

Uh oh!

Question about fixed std=0.02 initialization of `w1` in `moe.py` #1257

Question about fixed std=0.02 initialization of `w1` in `moe.py` #1257