Strange matching for Spanish phrase detected as Finnish #11
-
Hey! I've been messing with this library, most of it seems great! There is one issue I've ran into with a spanish phrase being detected as Finnish, as it has a confidence level of Phrase: With the following code: package main
import (
"log"
"github.com/pemistahl/lingua-go"
)
func main() {
detector := lingua.
NewLanguageDetectorBuilder().
FromAllSpokenLanguages().
WithPreloadedLanguageModels().
Build()
content := "¿les gustan los pokemon?"
lang, reliable := detector.DetectLanguageOf(content)
log.Println(lang.String(), reliable)
log.Println(" --- ")
confidences := detector.ComputeLanguageConfidenceValues(content)
for _, langConf := range confidences {
log.Println(langConf.Language().String(), langConf.Value())
}
} The following output is produced:
I'm not sure why it ranked Spanish as 4th. Is there a good method to get around this? Unfortunately given my use case I need to detect from a wide range of languages like this. This library is overall awesome, I'm using the latest stable release, thank you for this! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hello Aiden, thanks for trying my library and for your question. Well, it seems that for this specific sentence, the sum of the ngram probabilities for Finnish is greater than the one for Spanish. This is not a bug, this is just mathematics. The word pokemon is certainly crucial here. It is a proper noun, so it's neither Finnish nor Spanish. At best, it's Japanese. It contains ngrams that are not characteristic of Spanish, so it confuses the algorithm, returning Finnish. If you remove this word from your sentence, the detector returns Spanish as the most likely language. |
Beta Was this translation helpful? Give feedback.
Hello Aiden, thanks for trying my library and for your question.
Well, it seems that for this specific sentence, the sum of the ngram probabilities for Finnish is greater than the one for Spanish. This is not a bug, this is just mathematics. The word pokemon is certainly crucial here. It is a proper noun, so it's neither Finnish nor Spanish. At best, it's Japanese. It contains ngrams that are not characteristic of Spanish, so it confuses the algorithm, returning Finnish. If you remove this word from your sentence, the detector returns Spanish as the most likely language.