Non-causal implementation of language model for synthetic datasets

**Regarding synthetic datasets:** from the implementation and as it explained in [the issue](https://github.com/HazyResearch/safari/issues/35), _train loss is evaluated on all tokens and test is only evaluated on the last token_ . Then my question is what is the advantage of such autoregressive training strategy, which require the model to be causal, rather than simply modelling the training as a  classification problem, i.e. loss and accuracy of training is evaluated only on the last token as such that 
$p({y}[..., -1]) \simeq  Hyena(x) [..., -1]$
If we follow this training approach then the target is estimated based on all the token in the sentence and,  it seems that, it is not required for the model to be causal for datasets: _Associative Recall_ and _induction head_, is it trues?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Non-causal implementation of language model for synthetic datasets #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-causal implementation of language model for synthetic datasets #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions