source truncation size in summarization task

Hi,

According to the README file, for summarization (cnndm) task the following truncation setup is recommended:
`-src_seq_length_trunc 400`

However, on the training data, the average/median length of the source is 925/841, more than 90% of the data is longer than 400 BPE tokens, would it be problematic to throw away the rest of the text? Or is this simply for efficiency consideration? Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

source truncation size in summarization task #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

source truncation size in summarization task #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions