You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The first term comes from the $6ND$ estimate for non-embedding FLOPs (exclude lm_head parameters as well, maybe because of the tied embeddings), but the second term is not what Chinchilla used to calculate embedding FLOPs, see Appendix F from the Chinchilla paper, total forward pass FLOPs include embeddings and logits calculations.
So, $N_2$ should be larger than what is used in the paper (double the second term)?
The text was updated successfully, but these errors were encountered:
yzlnew
changed the title
Is the model scale representation identical to Chinchilla?
Is the compute calculation wrong for Chinchilla in the paper?
Apr 30, 2024
Starting from Table 3, I re-add the missing part to $N_2$ and recalculate $6N_2/M$, the results are [2.19, 1.50, 1.01, 1.01, 0.97, 0.95, 0.95], which is much closer to 1.0 for larger models.
Could you share the data of Figure 6 for further study? Thanks!
From the paper, Eq.2 list Chinchilla compute calculation as
The first term comes from the$6ND$ estimate for non-embedding FLOPs (exclude lm_head parameters as well, maybe because of the tied embeddings), but the second term is not what Chinchilla used to calculate embedding FLOPs, see Appendix F from the Chinchilla paper, total forward pass FLOPs include embeddings and logits calculations.
So,$N_2$ should be larger than what is used in the paper (double the second term)?
The text was updated successfully, but these errors were encountered: