Question about fixed std=0.02 initialization of w1
in moe.py
#1257
Labels
question
Further information is requested
w1
in moe.py
#1257
Hi torchtitan team,
Thanks for the great work on this project! I had a question regarding a detail in the code at moe.py#L92
torchtitan/torchtitan/experiments/llama4/model/moe.py
Line 92 in 768cde1
I noticed that
w1
is initialized with a fixed standard deviation of 0.02, whereasw2
andw3
are initialized using a configurableinit_std
parameter. I’m wondering if this discrepancy is intentional, and if so, what the reasoning is behind using a hardcoded value forw1
.Would greatly appreciate any insights you could share!
Thanks again!
The text was updated successfully, but these errors were encountered: