Filtering out colloquial and conversational expressions (phases) that make no sense to the nl process #10311
Replies: 1 comment 1 reply
-
Hey @gremur, thanks for the discussion. I think it makes sense to identify these phrases and remove them for your use case. If you have a list of these phrases, I'd recommend writing a custom component that identifies these spans and then you can do whatever you need to with them. More broadly, I don't think it's always the case that they should be removed. What if someone was doing a research project and wanted to understand how common these phrases were in conversations? These types of phrases do function similarly to stop words in that many times they're more functional than content phrases, but I think a key difference is that they operate more like open-class words—that is, new phrases like this are easily generated and used in everyday text. Such a list of phrases would be difficult to maintain and update as language evolves. Finally, if you want some way of identifying these phrases (often called multi-word expressions) given a lot of text, I've used gensim's Phraser for this type of thing. |
Beta Was this translation helpful? Give feedback.
-
Do you think it makes practical sense to implement the removal of "stop phrases" that are conversational "white noise" and do not affect the meaning of the text. I mean phrases like "for a while", "difficult to believe", "to be honest", "at present day", "frankly speaking", "by the way" and etc.
Arguments for this suggestion:
Beta Was this translation helpful? Give feedback.
All reactions