Spanish Lemmatizer doesn't handle vosotros (2pl) #11607
-
How to reproduce the behaviourWhile working with the Matcher, I noticed that the rules based lemmatizer is overall pretty good, but surprisingly it doesn't seem to be able to handle the vosotros form at all. Here are some examples, import spacy
nlp = spacy.load("es_core_news_lg")
examples = [
"Vosotros estabais decidiendo el menú de la boda.",
"Vais a comer pronto.",
"Vosotros habíais estado bromeando con las chicas.",
"Vosotros habíais estado bebiendo bastante antes de conducir.",
"Vosotros estáis mirando la puesta de sol.",
"Habéis estado chateando por skype una hora.",
]
for text in examples:
doc = nlp(text)
print(f"Text: {doc}")
print("{: <10} {: <10} {: <10} {: <20}".format("TOKEN", "LEMMA", "POS", "MORPH"))
for token in doc:
print("{: <10} {: <10} {: <10} {: <20}".format(token.text, token.lemma_, token.pos_, str(token.morph)))
print('-'*40)
It's interesting that in the compound tenses (i.e. Vosotros habíais estado bebiendo), it recognizes habíais as a verb and in the right tense, but it lemmatizes it as habíais instead of haber. We see the same thing with estáis (should lemmatize as estar). Is this is an issue with the rules based lemmatizer? If so, is there anything that I can do to help? I took a look at the code, and it appears at the vosotros is in the rules list, so I'm wondering why it's not working. If there isn't an easy fix from that aspect, is it possible to provide a partial lookup table where we still rely on the rules-based lemmatizer (which is pretty good most of the time!) and then fall back in the cases that we know are impossible, such as any time a verb is lemmatized and doesn't end in -ar, -ir, or -er? Any suggestions are welcomed! Thank you! Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
I think it's an issue with the You can double-check the rules here: Because there are very few 2nd person pronouns or verbs in the training data, the morphologizer does not learn how to tag them well. The general solution would be to add more training data with 2nd person forms so the morphologizer performs better. As more of a workaround for the current behavior, you can definitely use multiple components in your pipeline to preprocess the morphological tags or postprocess the lemmas to handle cases like this, but there's no built-in component to do exactly what you're asking. I could imagine a custom component that checks for incorrect-looking verb lemmas and falls back to a lookups table in these cases? |
Beta Was this translation helpful? Give feedback.
I think it's an issue with the
morphologizer
rather that the rules in the lemmatizer. You can see that the MORPH tags are incorrect, and the lemmatizer uses the POS+MORPH tags to pick which rules to apply. (Also there is not a single occurrence of "vosotros" in the training data.) In all the cases above it looks like the verbs are tagged asPerson=1
orPerson=3
so the lemmatizer isn't applying the intended rules.You can double-check the rules here:
https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/es_lemma_rules.json
Because there are very few 2nd person pronouns or verbs in the training data, the morphologizer does not learn how to tag them well. The ge…