Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES-CT: linguistics annotations #639

Open
8 tasks
matyaskopp opened this issue Apr 26, 2023 · 3 comments
Open
8 tasks

ES-CT: linguistics annotations #639

matyaskopp opened this issue Apr 26, 2023 · 3 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@matyaskopp
Copy link
Collaborator

@rjzevallos, I compared TEI and TEI.ana versions of one file. I don't know how complicated it is to fix these issues - I will be delighted if they are fixed or at least documented because these bugs can decrease corpus usability.

Syntactic words and join collision

  • syntactic words + join right

TEI version:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8" xml:lang="ca"><!-- 
... --> Hi havia, com vostè ha recordat, el segon tripartit. <!-- ... --></seg>

TEI.ana version:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8" xml:lang="ca">
<!-- ... -->
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3" xml:lang="ca">
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.1" msd="UPosTag=PRON|PronType=Prs|Person=3" lemma="hi">Hi</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.2" msd="UPosTag=VERB|Mood=Ind|Tense=Imp|Person=3|Number=Sing" lemma="heure" join="right">havia</w>
        <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.3" msd="UPosTag=PUNCT|PunctType=Comm">,</pc>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.4" msd="UPosTag=SCONJ" lemma="com">com</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.5" msd="UPosTag=PRON|PronType=Prs|Person=2|Number=Sing|Polite=Form" lemma="vostè">vostè</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.6" msd="UPosTag=AUX|Mood=Ind|Tense=Pres|Person=3|Number=Sing" lemma="haver">ha</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.7" msd="UPosTag=VERB|Mood=Par|Number=Sing|Gender=Masc" lemma="recordar" join="right">recordat</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.10">
           recordatel
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.8" msd="UPosTag=PUNCT|PunctType=Comm" norm="," lemma=","/>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.9" msd="UPosTag=DET|PronType=Art|Gender=Masc|Number=Sing" norm="el" lemma="el"/>
        </w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.11" msd="UPosTag=NOUN|Gender=Masc|Number=Sing" lemma="segon">segon</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.12" msd="UPosTag=VERB|Mood=Par|Number=Sing|Gender=Masc" lemma="tripartir" join="right">tripartit</w>
        <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.13" msd="UPosTag=PUNCT|PunctType=Peri">.</pc>
        <linkGrp type="UD-SYN" targFunc="head argument">
          <!-- ... -->
        </linkGrp>
    </s>
<!-- ... -->
</seg>

TEI version contains sentence:

Hi havia, com vostè ha recordat, el segon tripartit.

but TEI.ana version contains different a sentence:

Hi havia, com vostè ha recordatrecordatel segon tripartit.

Missing UD features in named entities, wrong UPosTag

  • UD features in NE
  • UPosTag in NE
<name type="MISC">
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.28" msd="UPosTag=PROPN" lemma="reglament">Reglament</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.29" msd="UPosTag=PROPN" lemma="de">de</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.30" msd="UPosTag=PROPN" lemma="el">el</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.31" msd="UPosTag=PROPN" lemma="parlament">Parlament</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.32" msd="UPosTag=PROPN" lemma="de">de</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.33" msd="UPosTag=PROPN" lemma="catalunya">Catalunya</w>
</name>

No syntactic words in named entities + missing join right

  • syntactic words and NEs
  • join right in NEs

TEI version:

               <seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3" xml:lang="ca"><!-- ... --> Reglament del Parlament de Catalunya.</seg>

TEI.ana version:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3" xml:lang="ca">
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1" xml:lang="ca">
        <name type="MISC">
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.28" msd="UPosTag=PROPN" lemma="reglament">Reglament</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.29" msd="UPosTag=PROPN" lemma="de">de</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.30" msd="UPosTag=PROPN" lemma="el">el</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.31" msd="UPosTag=PROPN" lemma="parlament">Parlament</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.32" msd="UPosTag=PROPN" lemma="de">de</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.33" msd="UPosTag=PROPN" lemma="catalunya">Catalunya</w>
        </name>
        <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.34" msd="UPosTag=PUNCT|PunctType=Peri">.</pc>
    </s>
</seg>

TEI version contains:

Reglament del Parlament de Catalunya.

but TEI.ana version contains:

Reglament de el Parlament de Catalunya .

Missing join right in articles

  • articles + join

TEI:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1" xml:lang="ca">D’acord amb l’article 146

TEI.ana

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1" xml:lang="ca">
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1" xml:lang="ca">
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.1" msd="UPosTag=ADP|AdpType=Prep" lemma="de">D'</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.2" msd="UPosTag=NOUN|Gender=Masc|Number=Sing" lemma="acord">acord</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.3" msd="UPosTag=ADP|AdpType=Prep" lemma="amb">amb</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.4" msd="UPosTag=DET|PronType=Art|Number=Sing" lemma="el">l'</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.5" msd="UPosTag=NOUN|Gender=Masc|Number=Sing" lemma="article">article</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.6" msd="UPosTag=NUM" lemma="146">146</w>

TEI version contains:

D’acord amb l’article 146

but TEI.ana version contains:

D’ acord amb l’ article 146

Syntactic words at the beginning of sentence ???

  • Syntactic words at the beginning of sentence

TEI:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25" xml:lang="ca"><!-- 
... --> Dels tribunals ordinaris de la justícia catalana. <!-- ... --></seg>

TEI.ana:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25" xml:lang="ca">
<!-- ... --> 
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5" xml:lang="ca">
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.1" msd="UPosTag=ADP|AdpType=Prep" lemma="de">De</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.2" msd="UPosTag=DET|PronType=Art|Gender=Masc|Number=Plur" lemma="el">els</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.3" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" lemma="tribunal">tribunals</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.4" msd="UPosTag=ADJ|Gender=Masc|Number=Plur" lemma="ordinari">ordinaris</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.5" msd="UPosTag=ADP|AdpType=Prep" lemma="de">de</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.6" msd="UPosTag=DET|PronType=Art|Gender=Fem|Number=Sing" lemma="el">la</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.7" msd="UPosTag=NOUN|Number=Sing" lemma="justícia">justícia</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.8" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="català" join="right">catalana</w>
       <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.9" msd="UPosTag=PUNCT|PunctType=Peri">.</pc>
<!-- ... -->
    </s>
<!-- ... -->
</seg>

TEI version contains:

Dels tribunals ordinaris de la justícia catalana.

but TEI.ana version contains:

De els tribunals ordinaris de la justícia catalana.

misplaced join right

  • misplaced join right

TEI:

... fixin-s’hi en ...

TEI.ana

    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.67">
    fixin-s
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.65" msd="UPosTag=VERB|Mood=Sub|Tense=Pres|Person=3|Number=Plur" norm="fixin" lemma="fixar"/>
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.66" msd="UPosTag=PRON" norm="-s" lemma="es"/>
    </w>
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.68" msd="UPosTag=PRON|PronType=Prs|Person=3" lemma="hi" join="right">'hi</w>
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.69" msd="UPosTag=ADP|AdpType=Prep" lemma="en">en</w>

TEI: fixin-s’hi en vs TEI.ana: fixin-s ’hien

@matyaskopp matyaskopp added the bug Something isn't working label Apr 26, 2023
@matyaskopp matyaskopp added this to the ParlaMint 3.1 release milestone Apr 26, 2023
@nuriabel
Copy link
Collaborator

nuriabel commented May 3, 2023 via email

@maartenpt
Copy link

contraction seem not to have been dealt with properly in general, not only at the beginning of a word - take the word "del", which is never split into de+el, and has a wide range of upos tags:

PARLAMINT-31-PARLAMINT-ES-CT> Matches = [form="del"];
PARLAMINT-31-PARLAMINT-ES-CT> group Matches match upos;
#---------------------------------------------------------------------
(all)                         NOUN                               95942
                              ADJ                                12833
                              PUNCT                              10153
                              ADP                                 9990
                              VERB                                8936
                              ADV                                 8071
                              PROPN                               8013
                              NUM                                 4084
                              CCONJ                               3954
                              DET                                 2099
                              AUX                                 1382
                              PRON                                 915
                              SCONJ                                308
                              INTJ                                   1

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Sep 29, 2023

  • 100_per_cent in annotated data
<seg xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2" xml:lang="ca">Les mesures efectives són: política de garantia de rendes, amb vies d'incidència directa en els infants; més pressupost –o, més aviat, més inversió– en polítiques d'infància, i indicadors i avaluació per garantir que les mesures i els recursos que s'hi destinin siguin efectius. Perquè també ja fa dos anys els deia, amb dades de l'Idescat, que les polítiques catalanes en aquest tema són molt ineficients. Les transferències socials només estaven aconseguint reduir un 15 per cent el risc de pobresa en la població menor de divuit anys a Catalunya. Amb dades del 2018. Una xifra que és ridícula, si la comparem amb què passa amb altres sectors de la població. Entre les persones de més de seixanta-cinc anys les transferències socials fan reduir un 82 per cent el risc de pobresa. Només un 15 per cent, en el cas dels infants; un 82 per cent en el cas de les persones grans.</seg>

annotated version:
image

<seg xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2" xml:lang="ca">
<!-- ... -->
                  <s xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7" xml:lang="ca">
<!-- ... -->
<w xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7.3" msd="UPosTag=NUM" lemma="15/100" join="right">15_per_cent</w>
<!-- ... -->
<w xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7.14" msd="UPosTag=NUM" lemma="82/100">82_per_cent</w>
</seg>

I understand that you have probably use it to fix wrong tokenization, but you forget to remove underscores _


same issue different unit:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants