-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposition: Using prior language probability to increase likelihood #101
Comments
Hi @slavaGanzin, thank you for this very interesting idea. :) I will evaluate whether the overall accuracy improves when applying prior probabilities. |
I agree this should dramatically increase quality. After using lingua-py in production at scale, we've noticed quite a few instances of small languages (eg. Bulgarian, Macedonian) predicted over much more likely ones |
Another related suggestion - allow us to pass in a dictionary with Let's say we're using social media data and we know (or have concluded) the primary language for each user. It would be useful to be able to tell lingua (perhaps even with some sort of probability, calculated from the language breakdown of the user's prior posts) what the expected language might be. E.g. I post in English 99% of the time, but sometimes I write in Spanish. So, in an ambiguous situation, it would be better to conclude that it is English. But, if I had other contextual metadata available (e.g. Knowing that the post is from a Spanish-centric group/page/hashtag etc...), the pre-provided probability could be different. If no argument is passed in, it could use some sort of global default, perhaps the one suggested by OP, which we could override for our own domains with a |
@nickchomey |
Hi duboff, I find lingua to be extremely slow like 10-20 strings/secs on MacBook Pro. Can you suggest some approach to make it usable in the prod environment? |
@bhaveshkr I've just written down some performance tips in the README. You probably want to read them. |
@pemistahl It's Great to see a new version! I was a bit afraid. Without putting undue pressure on you, do you think you are likely to consider the idea in this Issue or something similar any time soon?
I just did exactly what the readme told me, but our use case is typically short-ish strings. We run it on AWS Lambda where it works fine with increased timeout. |
@duboff Half a year ago or so, I did a quick evaluation of applying hard-coded prior probabilities. But the overall detection accuracy decreased significantly. So the proposed approach in this issue is not as promising as you may expect. I've kept this issue open so far as I think that it's worth doing more experiments in this direction. Not having enough free time is the limiting factor. This is an open source project, however, so feel free to fork and implement improvements yourself. I'm always happy about pull requests. |
I'm just going to reiterate that I think the approach I suggested is clearly the right one - allow us to pass in our own probabilities rather than have them hardcoded. |
@pemistahl Peter, I think it would be beneficial for this library to have a separate method that will add probability prior (in a Bayesian way) to the mix.
Let's look into statistics: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
So if 57% of texts, that you see on the internet, are in English so, if you predicted "English" for any input you would be wrong only in 43%. It's like a stopped clock, but it is right every second probe.
For example: #100
Based on that premise, if we are using just plain character statistics "как дела" is more Macedonian than Russian. But overall, if we add language statistics to the mix, lingua-puy would be "wrong" less often.
There are more Russian-speaking users of this library, than Macedonians, just because there are more Russian-speaking people overall. And so when a random user writes "как дела" it's "more accurate" to predict "russian" than "macedonian", just because in general that is what is expected by these users.
So my proposition to add
detector.detect_language_with_prior
function and factorize it with prior: likelihood = probability X prior_probabilityFor example: #97
Linked issues:
The text was updated successfully, but these errors were encountered: