Proposition: Using prior language probability to increase likelihood

@pemistahl Peter, I think it would be beneficial for this library to have a separate method that will add probability prior (in a Bayesian way) to the mix. 

Let's look into statistics: https://en.wikipedia.org/wiki/Languages_used_on_the_Internet 

So if 57% of texts, that you see on the internet, are in English so, if you predicted "English" for any input you would be wrong only in 43%. It's like a stopped clock, but it is right every second probe.

For example: https://github.com/pemistahl/lingua-py/issues/100 

Based on that premise, if we are using just plain character statistics "как дела" is more Macedonian than Russian. But overall, if we add language statistics to the mix, lingua-puy would be "wrong" less often.

There are more Russian-speaking users of this library, than Macedonians, just because there are more Russian-speaking people overall. And so when a random user writes "как дела" it's "more accurate" to predict "russian" than "macedonian", just because in general that is what is expected by these users.

So my proposition to add `detector.detect_language_with_prior` function and factorize it with prior: likelihood = probability X prior_probability

For example: https://github.com/pemistahl/lingua-py/issues/97
```python
detector.detect_language_of("Hello")

"ITALIAN": 0.9900000000000001,
"SPANISH": 0.8457074930316446,
"ENGLISH": 0.6405700388041755,
"FRENCH": 0.260556921899765,
"GERMAN": 0.01,
"CHINESE": 0,
"RUSSIAN": 0
```

```python
detector.detect_language_with_prior("Hello")

# Of course constants are for illustrative purposes only.
# Results should be normalized afterwords
"ENGLISH": 0.6405700388041755 * 0.577,
"SPANISH": 0.8457074930316446 * 0.045,
"ITALIAN": 0.9900000000000001 * 0.017,
"FRENCH": 0.260556921899765 * 0.039,
```

Linked issues:
- https://github.com/pemistahl/lingua-py/issues/94
- https://github.com/pemistahl/lingua-py/issues/100 
- https://github.com/pemistahl/lingua-py/issues/97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposition: Using prior language probability to increase likelihood #101

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Proposition: Using prior language probability to increase likelihood #101

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions