Skip to content

source truncation size in summarization task #3

@XinyuHua

Description

@XinyuHua

Hi,

According to the README file, for summarization (cnndm) task the following truncation setup is recommended:
-src_seq_length_trunc 400

However, on the training data, the average/median length of the source is 925/841, more than 90% of the data is longer than 400 BPE tokens, would it be problematic to throw away the rest of the text? Or is this simply for efficiency consideration? Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions