Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

The Neapolitan model is unusable #1356

Open
bfontaine opened this issue Jan 3, 2024 · 0 comments
Open

The Neapolitan model is unusable #1356

bfontaine opened this issue Jan 3, 2024 · 0 comments

Comments

@bfontaine
Copy link

Hello,
The model for Neapolitan (nap) is unusable because of its poor quality:

>>> ft.get_nearest_neighbors("cuccuziello")
[(0.9683643579483032, 'Masiello'), (0.9683618545532227, 'soldatiello'), (0.9682843685150146, 'Mezzaniello'), (0.9651128053665161, 'perettiello'), (0.963299572467804, 'maretiello'), (0.9630503058433533, 'nnammoratiello'), (0.9629217386245728, 'Fermariello'), (0.9614925384521484, 'poveriello'), (0.9613924622535706, 'Manniello'), (0.9589092135429382, 'ciancianiello')]

The above code shows the nearest neighbors of the word for "zucchini": gets "Masiello" (a family name), "soldatiello" (diminutive of "soldat"), "Mezzaniello" (type of pasta), "perettiello" (type of container for the wine), "maretiello" (diminutive of "husband"), etc.

Let’s try with mare (sea):

>>> ft.get_nearest_neighbors("mare")
[(0.6819297671318054, 'maree'), (0.6802213788032532, 'sommare'), (0.67812180519104, 'Altomare'), (0.6762729287147522, 'mmare'), (0.6754312515258789, 'sciummare'), (0.6556524038314819, 'Oltremare'), (0.6542813181877136, 'amare'), (0.6521005630493164, 'Croismare'), (0.6465907692909241, 'lungomare'), (0.6444516181945801, 'Zimmare')]

Here it’s marginally better: 40% of the words are related to the sea, probably because "mare" is the same in Italian and all those words come from Italian.

Let’s try with a famous word, guaglione (young man, adolescent):

>>> ft.get_nearest_neighbors("guaglione")
[(0.9444118738174438, 'gguaglione'), (0.9239395260810852, 'uaglione'), (0.922201931476593, 'Quaglione'), (0.9067193269729614, 'Guaglione'), (0.8721657991409302, 'Scaglione'), (0.8564983010292053, 'Baglione'), (0.8542811870574951, 'Faraglione'), (0.8541175127029419, 'muraglione'), (0.8494646549224854, 'Zampaglione'), (0.8474137783050537, 'Maglione')]

"gguaglione" (feminine plural), "uaglione" (variant) and "Guaglione" (with a capital letter) are various versions of "guaglione", but the other words have nothing to do with it.

Is there anything one can do to improve the accuracy of the model, or is it inherent to the small size of the corpus?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant