-
Notifications
You must be signed in to change notification settings - Fork 100
Open
Description
it seems that the stopword list does not handle contractions well, such as "we've", "they're", etc. These are common in spoken language. Is there a recommended way to preprocess a corpus to check and replace contractions, or a way to enable specifically removing them?
I see them come up in my topic FREX words as "weve" or "theyr" so perhaps the order of punctuation removal and stemming matters, too.
Long term, maybe it would be great to have a spoken language option for prepDocuments()
that can handle these cases (and others).
Metadata
Metadata
Assignees
Labels
No labels