You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the README file, for summarization (cnndm) task the following truncation setup is recommended: -src_seq_length_trunc 400
However, on the training data, the average/median length of the source is 925/841, more than 90% of the data is longer than 400 BPE tokens, would it be problematic to throw away the rest of the text? Or is this simply for efficiency consideration? Thanks!