Skip to content

Non-causal implementation of language model for synthetic datasets #42

@Karami-m

Description

@Karami-m

Regarding synthetic datasets: from the implementation and as it explained in the issue, train loss is evaluated on all tokens and test is only evaluated on the last token . Then my question is what is the advantage of such autoregressive training strategy, which require the model to be causal, rather than simply modelling the training as a classification problem, i.e. loss and accuracy of training is evaluated only on the last token as such that
$p({y}[..., -1]) \simeq Hyena(x) [..., -1]$
If we follow this training approach then the target is estimated based on all the token in the sentence and, it seems that, it is not required for the model to be causal for datasets: Associative Recall and induction head, is it trues?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions