You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for the amazing library. I'm using lingua-py 2.0.2 to classify language of crawled webpages. The text is simply extracted by BeautifulSoup using soup.get_text(' ').
I noticed a abnormally high number of Yoruba detection (and maybe some other minor languages I didn't notice). After checking some results, it looks like it's because the webpage contains text of many languages, usually from language selection dropdowns.
Using LanguageDetectorBuilder.from_all_languages(), it recognizes the text as Yoruba with 1.0 confidence.
As a human, I think there is still enough features (the text at the beginning and end) to recognize the webpage as "mainly English".
It's reasonable that lingua-py does not recognize the mix-language text as English, but recognizing it as Yoruba with 100% confidence doesn't look correct at all. I'd like lingua-py at least output a low confidence so I can filter out the problematic text.
Thank you for the amazing library. I'm using lingua-py 2.0.2 to classify language of crawled webpages. The text is simply extracted by BeautifulSoup using
soup.get_text(' ')
.I noticed a abnormally high number of Yoruba detection (and maybe some other minor languages I didn't notice). After checking some results, it looks like it's because the webpage contains text of many languages, usually from language selection dropdowns.
This is an example input: lingua-test.txt
Using
LanguageDetectorBuilder.from_all_languages()
, it recognizes the text as Yoruba with 1.0 confidence.As a human, I think there is still enough features (the text at the beginning and end) to recognize the webpage as "mainly English".
It's reasonable that lingua-py does not recognize the mix-language text as English, but recognizing it as Yoruba with 100% confidence doesn't look correct at all. I'd like lingua-py at least output a low confidence so I can filter out the problematic text.
The text was updated successfully, but these errors were encountered: