Filtering out colloquial and conversational expressions (phases) that make no sense to the nl process #10311

gremur · 2022-02-16T13:45:44Z

gremur
Feb 16, 2022

Do you think it makes practical sense to implement the removal of "stop phrases" that are conversational "white noise" and do not affect the meaning of the text. I mean phrases like "for a while", "difficult to believe", "to be honest", "at present day", "frankly speaking", "by the way" and etc.

Arguments for this suggestion:

Such phrases can be completely removed (filtered) because they do not affect the meaning of the text.
Such phrases cannot be completely removed by removing stop words because they may contain regular words than just stop words.
Removing only stop words will leave the "chunks" of these phrases and this may have a negative impact on the formation of n-grams.

pmbaumgartner · 2022-02-16T14:31:05Z

pmbaumgartner
Feb 16, 2022

Hey @gremur, thanks for the discussion. I think it makes sense to identify these phrases and remove them for your use case. If you have a list of these phrases, I'd recommend writing a custom component that identifies these spans and then you can do whatever you need to with them.

More broadly, I don't think it's always the case that they should be removed. What if someone was doing a research project and wanted to understand how common these phrases were in conversations? These types of phrases do function similarly to stop words in that many times they're more functional than content phrases, but I think a key difference is that they operate more like open-class words—that is, new phrases like this are easily generated and used in everyday text. Such a list of phrases would be difficult to maintain and update as language evolves.

Finally, if you want some way of identifying these phrases (often called multi-word expressions) given a lot of text, I've used gensim's Phraser for this type of thing.

1 reply

gremur Feb 16, 2022
Author

Thank you for the detailed and clear answer. I agree that there are situations where such phrases can be useful in certain cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering out colloquial and conversational expressions (phases) that make no sense to the nl process #10311

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Filtering out colloquial and conversational expressions (phases) that make no sense to the nl process #10311

gremur Feb 16, 2022

Replies: 1 comment · 1 reply

pmbaumgartner Feb 16, 2022

gremur Feb 16, 2022 Author

gremur
Feb 16, 2022

Replies: 1 comment 1 reply

pmbaumgartner
Feb 16, 2022

gremur Feb 16, 2022
Author