You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for reproducing Differential Transformer. It seems there are some problems in your reproducing code. You should split q and k in n_head dimension, do re-parameterization for lambda, and add GN with gamma. You can refer to the official code (https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_diffattn.py) for details.
The text was updated successfully, but these errors were encountered:
You have a point but the goal of this implementation is not to reproduce official code - aims at implementing core components of the architecture for compute-constrained educational purpose. I went through the official code (cited it in the notebook), I will update the code to include some of the original later. Thank you for pointing out.
Hi, thanks for reproducing Differential Transformer. It seems there are some problems in your reproducing code. You should split q and k in n_head dimension, do re-parameterization for lambda, and add GN with gamma. You can refer to the official code (https://github.com/microsoft/unilm/blob/master/Diff-Transformer/multihead_diffattn.py) for details.
The text was updated successfully, but these errors were encountered: