Description
@pemistahl Peter, I think it would be beneficial for this library to have a separate method that will add probability prior (in a Bayesian way) to the mix.
Let's look into statistics: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
So if 57% of texts, that you see on the internet, are in English so, if you predicted "English" for any input you would be wrong only in 43%. It's like a stopped clock, but it is right every second probe.
For example: #100
Based on that premise, if we are using just plain character statistics "как дела" is more Macedonian than Russian. But overall, if we add language statistics to the mix, lingua-puy would be "wrong" less often.
There are more Russian-speaking users of this library, than Macedonians, just because there are more Russian-speaking people overall. And so when a random user writes "как дела" it's "more accurate" to predict "russian" than "macedonian", just because in general that is what is expected by these users.
So my proposition to add detector.detect_language_with_prior
function and factorize it with prior: likelihood = probability X prior_probability
For example: #97
detector.detect_language_of("Hello")
"ITALIAN": 0.9900000000000001,
"SPANISH": 0.8457074930316446,
"ENGLISH": 0.6405700388041755,
"FRENCH": 0.260556921899765,
"GERMAN": 0.01,
"CHINESE": 0,
"RUSSIAN": 0
detector.detect_language_with_prior("Hello")
# Of course constants are for illustrative purposes only.
# Results should be normalized afterwords
"ENGLISH": 0.6405700388041755 * 0.577,
"SPANISH": 0.8457074930316446 * 0.045,
"ITALIAN": 0.9900000000000001 * 0.017,
"FRENCH": 0.260556921899765 * 0.039,
Linked issues: