Skip to content

stm handling contractions #293

@val-pf

Description

@val-pf

it seems that the stopword list does not handle contractions well, such as "we've", "they're", etc. These are common in spoken language. Is there a recommended way to preprocess a corpus to check and replace contractions, or a way to enable specifically removing them?
I see them come up in my topic FREX words as "weve" or "theyr" so perhaps the order of punctuation removal and stemming matters, too.
Long term, maybe it would be great to have a spoken language option for prepDocuments() that can handle these cases (and others).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions