You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regarding synthetic datasets: from the implementation and as it explained in the issue, train loss is evaluated on all tokens and test is only evaluated on the last token . Then my question is what is the advantage of such autoregressive training strategy, which require the model to be causal, rather than simply modelling the training as a classification problem, i.e. loss and accuracy of training is evaluated only on the last token as such that $p({y}[..., -1]) \simeq Hyena(x) [..., -1]$
If we follow this training approach then the target is estimated based on all the token in the sentence and, it seems that, it is not required for the model to be causal for datasets: Associative Recall and induction head, is it trues?