Spanish Lemmatizer doesn't handle vosotros (2pl) #11607

killia15 · 2022-10-11T05:02:00Z

killia15
Oct 11, 2022

How to reproduce the behaviour

While working with the Matcher, I noticed that the rules based lemmatizer is overall pretty good, but surprisingly it doesn't seem to be able to handle the vosotros form at all. Here are some examples,

import spacy
nlp = spacy.load("es_core_news_lg")

examples = [
    "Vosotros estabais decidiendo el menú de la boda.",
    "Vais a comer pronto.",
    "Vosotros habíais estado bromeando con las chicas.",
    "Vosotros habíais estado bebiendo bastante antes de conducir.",
    "Vosotros estáis mirando la puesta de sol.",
    "Habéis estado chateando por skype una hora.",
]

for text in examples:
    doc = nlp(text)
    print(f"Text: {doc}")
    print("{: <10} {: <10} {: <10} {: <20}".format("TOKEN", "LEMMA", "POS", "MORPH"))
    for token in doc:
        print("{: <10} {: <10} {: <10} {: <20}".format(token.text, token.lemma_, token.pos_, str(token.morph)))
    print('-'*40)

Text: Vosotros estabais decidiendo el menú de la boda.
TOKEN      LEMMA      POS        MORPH               
Vosotros   tú         PRON       Case=Acc,Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs
estabais   estabais   AUX        Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin
decidiendo decidir    VERB       VerbForm=Ger        
el         el         DET        Definite=Def|Gender=Masc|Number=Sing|PronType=Art
menú       menú       NOUN       Gender=Masc|Number=Sing
de         de         ADP                            
la         el         DET        Definite=Def|Gender=Fem|Number=Sing|PronType=Art
boda       boda       NOUN       Gender=Fem|Number=Sing
.          .          PUNCT      PunctType=Peri      
----------------------------------------
Text: Vais a comer pronto.
TOKEN      LEMMA      POS        MORPH               
Vais       vais       VERB       Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part
a          a          ADP                            
comer      comer      VERB       VerbForm=Inf        
pronto     pronto     ADV                            
.          .          PUNCT      PunctType=Peri      
----------------------------------------
Text: Vosotros habíais estado bromeando con las chicas.
TOKEN      LEMMA      POS        MORPH               
Vosotros   tú         PRON       Case=Acc,Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs
habíais    habíais    AUX        Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin
estado     estar      AUX        Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part
bromeando  bromear    VERB       VerbForm=Ger        
con        con        ADP                            
las        el         DET        Definite=Def|Gender=Fem|Number=Plur|PronType=Art
chicas     chica      NOUN       Gender=Fem|Number=Plur
.          .          PUNCT      PunctType=Peri      
----------------------------------------
Text: Vosotros habíais estado bebiendo bastante antes de conducir.
TOKEN      LEMMA      POS        MORPH               
Vosotros   tú         PRON       Case=Acc,Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs
habíais    habíais    AUX        Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin
estado     estar      AUX        Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part
bebiendo   beber      VERB       VerbForm=Ger        
bastante   bastante   ADV                            
antes      antes      ADV                            
de         de         ADP                            
conducir   conducir   VERB       VerbForm=Inf        
.          .          PUNCT      PunctType=Peri      
----------------------------------------
Text: Vosotros estáis mirando la puesta de sol.
TOKEN      LEMMA      POS        MORPH               
Vosotros   tú         PRON       Case=Acc,Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs
estáis     estáis     AUX        Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin
mirando    mirar      VERB       VerbForm=Ger        
la         el         DET        Definite=Def|Gender=Fem|Number=Sing|PronType=Art
puesta     puesta     NOUN                           
de         de         ADP                            
sol        sol        NOUN       Gender=Masc|Number=Sing
.          .          PUNCT      PunctType=Peri      
----------------------------------------
Text: Habéis estado chateando por skype una hora.
TOKEN      LEMMA      POS        MORPH               
Habéis     habéis     AUX        Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part
estado     estar      AUX        Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part
chateando  chatear    VERB       VerbForm=Ger        
por        por        ADP                            
skype      skype      PROPN                          
una        uno        DET        Definite=Ind|Gender=Fem|Number=Sing|PronType=Art
hora       hora       NOUN       Gender=Fem|Number=Sing
.          .          PUNCT      PunctType=Peri      
----------------------------------------

It's interesting that in the compound tenses (i.e. Vosotros habíais estado bebiendo), it recognizes habíais as a verb and in the right tense, but it lemmatizes it as habíais instead of haber. We see the same thing with estáis (should lemmatize as estar).

Is this is an issue with the rules based lemmatizer? If so, is there anything that I can do to help? I took a look at the code, and it appears at the vosotros is in the rules list, so I'm wondering why it's not working. If there isn't an easy fix from that aspect, is it possible to provide a partial lookup table where we still rely on the rules-based lemmatizer (which is pretty good most of the time!) and then fall back in the cases that we know are impossible, such as any time a verb is lemmatized and doesn't end in -ar, -ir, or -er? Any suggestions are welcomed!

Thank you!

Your Environment

Operating System: Mac OS 12.3.1
Python Version Used: 3.8.10
spaCy Version Used: 3.4.1
Environment Information:

Answered by adrianeboyd

Oct 11, 2022

I think it's an issue with the morphologizer rather that the rules in the lemmatizer. You can see that the MORPH tags are incorrect, and the lemmatizer uses the POS+MORPH tags to pick which rules to apply. (Also there is not a single occurrence of "vosotros" in the training data.) In all the cases above it looks like the verbs are tagged as Person=1 or Person=3 so the lemmatizer isn't applying the intended rules.

You can double-check the rules here:

https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/es_lemma_rules.json

Because there are very few 2nd person pronouns or verbs in the training data, the morphologizer does not learn how to tag them well. The ge…

View full answer

adrianeboyd · 2022-10-11T07:11:08Z

adrianeboyd
Oct 11, 2022

I think it's an issue with the morphologizer rather that the rules in the lemmatizer. You can see that the MORPH tags are incorrect, and the lemmatizer uses the POS+MORPH tags to pick which rules to apply. (Also there is not a single occurrence of "vosotros" in the training data.) In all the cases above it looks like the verbs are tagged as Person=1 or Person=3 so the lemmatizer isn't applying the intended rules.

You can double-check the rules here:

https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/es_lemma_rules.json

Because there are very few 2nd person pronouns or verbs in the training data, the morphologizer does not learn how to tag them well. The general solution would be to add more training data with 2nd person forms so the morphologizer performs better.

As more of a workaround for the current behavior, you can definitely use multiple components in your pipeline to preprocess the morphological tags or postprocess the lemmas to handle cases like this, but there's no built-in component to do exactly what you're asking. I could imagine a custom component that checks for incorrect-looking verb lemmas and falls back to a lookups table in these cases?

3 replies

killia15 Oct 11, 2022
Author

@adrianeboyd thanks for the idea! I'll look into seeing how I can add more training data to help the morphologizer since we have a large set of good examples.

As far as adding component to the pipeline to preprocess the morphological tags, are there any downsides that you can think of using a lookup as a fallback? One thing that comes to mind is if we see that a tag is wrong and fix it, are tags reliant on each other? For example, if one token is incorrectly classified, will the next one potentially be as well? Can I fix a token, and re-run the classification of the next tokens, or is this not a factor?

adrianeboyd Oct 12, 2022

The default tagger and morphologizer only use lexical features, so their predictions don't change based on existing tag/morph annotation. The components do have an overwrite setting, but that just determines whether any existing tags are clobbered. The tags that are predicted by the model in the background are always the same.

You can try out the existing lemma lookup table for Spanish and see if it covers enough of the cases for your task. I don't know enough about Spanish and how the verb endings in different conjugations overlap, but if it's possible to make reasonable guesses about the correct MORPH tags based on the endings just for verbs that obviously get incorrect non-infinitive lemmas, you could take advantage of the rules in the rule-based lemmatizer and it would generalize better than a lookup table.

killia15 Oct 13, 2022
Author

Thanks @adrianeboyd, that's what I ended up going with. At least for the verbs, each tense and person has a limited few options of suffixes and so if there's something that doesn't make sense, I can fall back to a lookup table. It was pretty easy to add a step in the pipeline after the morphologizer. Ideally then I'd be able to train the model to handle those cases in the future. Still learning how to train existing models like es_core_news_lg with corrections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Spanish Lemmatizer doesn't handle vosotros (2pl) #11607

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Spanish Lemmatizer doesn't handle vosotros (2pl) #11607

Uh oh!

killia15 Oct 11, 2022

How to reproduce the behaviour

Your Environment

Replies: 1 comment · 3 replies

Uh oh!

adrianeboyd Oct 11, 2022

Uh oh!

killia15 Oct 11, 2022 Author

Uh oh!

adrianeboyd Oct 12, 2022

Uh oh!

killia15 Oct 13, 2022 Author

killia15
Oct 11, 2022

Replies: 1 comment 3 replies

adrianeboyd
Oct 11, 2022

killia15 Oct 11, 2022
Author

killia15 Oct 13, 2022
Author