You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A long long time ago, back in the days when n-gram modelling ruled, our LMs didn't carry context over from one sentence to another. In the current era of LLMs we use many past sentences as context.
The supplied data in data/wikitext-2 acknowledges this change and uses '.' as the token to represent the full stop at the end of words. However, the code in data.py is still n-gram style and explicitly enforces the token <eos> at the end of every line.
This is (extremely valuable!) example code, it should be as clean and general as possible. The use of <eos> adds nothing and should be removed. Indeed, it can be seen as a bug as it's adding an artificial token when it's easily predictable so it will artificially reduce perplexity. Example code should be general and not have hard wired variables in it from a legacy implementation. If people want to have a <eos> token then it should be in the data like wikitext-2, if they don't want to use it then it shouldn't be enforced.
It would be very easy to remove the cases of + ['<eos>'] in data.py and the resulting example code would be more general and more scientific.
The text was updated successfully, but these errors were encountered:
A long long time ago, back in the days when n-gram modelling ruled, our LMs didn't carry context over from one sentence to another. In the current era of LLMs we use many past sentences as context.
The supplied data in
data/wikitext-2
acknowledges this change and uses '.' as the token to represent the full stop at the end of words. However, the code in data.py is still n-gram style and explicitly enforces the token<eos>
at the end of every line.This is (extremely valuable!) example code, it should be as clean and general as possible. The use of
<eos>
adds nothing and should be removed. Indeed, it can be seen as a bug as it's adding an artificial token when it's easily predictable so it will artificially reduce perplexity. Example code should be general and not have hard wired variables in it from a legacy implementation. If people want to have a<eos>
token then it should be in the data likewikitext-2
, if they don't want to use it then it shouldn't be enforced.It would be very easy to remove the cases of
+ ['<eos>']
in data.py and the resulting example code would be more general and more scientific.The text was updated successfully, but these errors were encountered: