One-liners for finding whether something is English or not #184
-
Right now in a dataset of English famous quotes there are a lot of non-English quotes (including Hindi), and that it needs to be checked. Are there a one-liner or similar that will check all possible common languages compared to English? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @TomLucidor, I'm afraid I don't completely understand what you want to do. So you want to identify all non-English parts in your text? If you are able to tokenize your text into separate parts so that the non-English parts can be treated in isolation, then Lingua should be able to identify them as non-English. Please correct me if I'm mistaken what you want to do. |
Beta Was this translation helpful? Give feedback.
Sorry I found my solution for language classification for a quote dataset (realizing that "builders" looks intimidating but there should be defaults for spoken language), and that in most cases such quotes are monolingual (assumed so even though they can be bilingual e.g. French with English translation).