Is the compute calculation wrong for Chinchilla in the paper?

From [the paper](https://arxiv.org/abs/2401.02954), Eq.2 list Chinchilla compute calculation as

$$6N_2 = 72 n_\text{layer}d_\text{model}^2 + 6n_\text{vocab}d_\text{model}$$

The first term comes from the $6ND$ estimate for non-embedding FLOPs (exclude lm_head parameters as well, maybe because of the tied embeddings), but the second term is not what Chinchilla used to calculate embedding FLOPs, see Appendix F from the Chinchilla paper, total forward pass FLOPs include embeddings and logits calculations.

So, $N_2$ should be larger than what is used in the paper (double the second term)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is the compute calculation wrong for Chinchilla in the paper? #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is the compute calculation wrong for Chinchilla in the paper? #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions