You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed your implementation of QK-Norm differs from the original paper, where normalization is applied after head splitting of the q, k states. I wonder what's the rationale of your implementation that apply normalization before head splitting? Thanks!
The text was updated successfully, but these errors were encountered:
There is no special reason for that... it is just something that we had implemented long ago and did not cared too much afterwards. But I agree that applying norm after head splitting looks more reasonable and we may change to that in our following works
I noticed your implementation of QK-Norm differs from the original paper, where normalization is applied after head splitting of the q, k states. I wonder what's the rationale of your implementation that apply normalization before head splitting? Thanks!
The text was updated successfully, but these errors were encountered: