Strange matching for Spanish phrase detected as Finnish #11

aidenwallis · 2022-04-19T23:25:30Z

aidenwallis
Apr 19, 2022

Hey! I've been messing with this library, most of it seems great! There is one issue I've ran into with a spanish phrase being detected as Finnish, as it has a confidence level of 1, I'm unsure if this is intended.

Phrase: ¿les gustan los pokemon?

With the following code:

package main

import (
	"log"

	"github.com/pemistahl/lingua-go"
)

func main() {
	detector := lingua.
		NewLanguageDetectorBuilder().
		FromAllSpokenLanguages().
		WithPreloadedLanguageModels().
		Build()

	content := "¿les gustan los pokemon?"
	lang, reliable := detector.DetectLanguageOf(content)
	log.Println(lang.String(), reliable)

	log.Println(" --- ")

	confidences := detector.ComputeLanguageConfidenceValues(content)
	for _, langConf := range confidences {
		log.Println(langConf.Language().String(), langConf.Value())
	}
}

The following output is produced:

2022/04/20 00:22:43 Finnish true
2022/04/20 00:22:43  --- 
2022/04/20 00:22:43 Finnish 1
2022/04/20 00:22:43 English 0.9883978684270469
2022/04/20 00:22:43 Indonesian 0.978563900119626
2022/04/20 00:22:43 Spanish 0.9747851212151981
2022/04/20 00:22:43 Croatian 0.9724182360849759
2022/04/20 00:22:43 Lithuanian 0.9647225277871057
2022/04/20 00:22:43 Estonian 0.9641581778214242
2022/04/20 00:22:43 Esperanto 0.9606587809451471
2022/04/20 00:22:43 Polish 0.9594230676987932
2022/04/20 00:22:43 Slovene 0.9546050214213473
2022/04/20 00:22:43 Malay 0.9541465232681227
2022/04/20 00:22:43 Albanian 0.9524198444722406
2022/04/20 00:22:43 Italian 0.9486618781887298
2022/04/20 00:22:43 Catalan 0.946963416607054
2022/04/20 00:22:43 Danish 0.9403916449998727
2022/04/20 00:22:43 Bosnian 0.9269675882527444
2022/04/20 00:22:43 Portuguese 0.9261989417434195
2022/04/20 00:22:43 German 0.919921338933763
2022/04/20 00:22:43 Sotho 0.9152876229202939
2022/04/20 00:22:43 Dutch 0.9145928120132025
2022/04/20 00:22:43 French 0.9140644855054184
2022/04/20 00:22:43 Slovak 0.9125324543349711
2022/04/20 00:22:43 Latvian 0.9119548274103094
2022/04/20 00:22:43 Tswana 0.9030296447404719
2022/04/20 00:22:43 Romanian 0.8980252449808623
2022/04/20 00:22:43 Nynorsk 0.8962667914904449
2022/04/20 00:22:43 Tagalog 0.8961041054613276
2022/04/20 00:22:43 Swedish 0.8861739698250194
2022/04/20 00:22:43 Hungarian 0.8860583424196719
2022/04/20 00:22:43 Bokmal 0.8860501842325473
2022/04/20 00:22:43 Swahili 0.8855438630695021
2022/04/20 00:22:43 Czech 0.877987508198549
2022/04/20 00:22:43 Welsh 0.8706583132077192
2022/04/20 00:22:43 Turkish 0.8635506224236865
2022/04/20 00:22:43 Yoruba 0.8618678522282041
2022/04/20 00:22:43 Basque 0.8587542505212317
2022/04/20 00:22:43 Afrikaans 0.8435800177987139
2022/04/20 00:22:43 Maori 0.8429171795365868
2022/04/20 00:22:43 Ganda 0.8407646218672701
2022/04/20 00:22:43 Icelandic 0.8248853640378799
2022/04/20 00:22:43 Tsonga 0.8245248538291974
2022/04/20 00:22:43 Irish 0.817982923494266
2022/04/20 00:22:43 Zulu 0.8175325635441859
2022/04/20 00:22:43 Shona 0.8008811823165958
2022/04/20 00:22:43 Xhosa 0.7829601259301775
2022/04/20 00:22:43 Vietnamese 0.774240344355879
2022/04/20 00:22:43 Azerbaijani 0.7541427903961347
2022/04/20 00:22:43 Somali 0.7538078988192347

I'm not sure why it ranked Spanish as 4th. Is there a good method to get around this? Unfortunately given my use case I need to detect from a wide range of languages like this.

This library is overall awesome, I'm using the latest stable release, thank you for this!

Answered by pemistahl

Apr 20, 2022

Hello Aiden, thanks for trying my library and for your question.

Well, it seems that for this specific sentence, the sum of the ngram probabilities for Finnish is greater than the one for Spanish. This is not a bug, this is just mathematics. The word pokemon is certainly crucial here. It is a proper noun, so it's neither Finnish nor Spanish. At best, it's Japanese. It contains ngrams that are not characteristic of Spanish, so it confuses the algorithm, returning Finnish. If you remove this word from your sentence, the detector returns Spanish as the most likely language.

View full answer

pemistahl · 2022-04-20T18:54:34Z

pemistahl
Apr 20, 2022
Maintainer

Hello Aiden, thanks for trying my library and for your question.

Well, it seems that for this specific sentence, the sum of the ngram probabilities for Finnish is greater than the one for Spanish. This is not a bug, this is just mathematics. The word pokemon is certainly crucial here. It is a proper noun, so it's neither Finnish nor Spanish. At best, it's Japanese. It contains ngrams that are not characteristic of Spanish, so it confuses the algorithm, returning Finnish. If you remove this word from your sentence, the detector returns Spanish as the most likely language.

1 reply

aidenwallis Apr 20, 2022
Author

Interesting, thanks for the context! Sorry for posting this in the wrong place!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strange matching for Spanish phrase detected as Finnish #11

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strange matching for Spanish phrase detected as Finnish #11

Uh oh!

Uh oh!

aidenwallis Apr 19, 2022

Replies: 1 comment · 1 reply

Uh oh!

pemistahl Apr 20, 2022 Maintainer

Uh oh!

aidenwallis Apr 20, 2022 Author

aidenwallis
Apr 19, 2022

Replies: 1 comment 1 reply

pemistahl
Apr 20, 2022
Maintainer

aidenwallis Apr 20, 2022
Author