-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES-CT: linguistics annotations #639
Comments
Dear Matyás,
Thanks a lot for your careful analysis. Yes, we are going to correct most
of the errors you found, but for the Named Entities analysis. We will keep
you updated.
Best regards
N.
El mié, 26 abr 2023 a las 10:20, Matyáš Kopp ***@***.***>)
escribió:
… Assigned #639 <#639> to
@nuriabel <https://github.com/nuriabel>.
—
Reply to this email directly, view it on GitHub
<#639 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGFJPVYMD53FHSTV55CJFQLXDDLGHANCNFSM6AAAAAAXMCOQFM>
.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
contraction seem not to have been dealt with properly in general, not only at the beginning of a word - take the word "del", which is never split into de+el, and has a wide range of upos tags:
|
<seg xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2" xml:lang="ca">Les mesures efectives són: política de garantia de rendes, amb vies d'incidència directa en els infants; més pressupost –o, més aviat, més inversió– en polítiques d'infància, i indicadors i avaluació per garantir que les mesures i els recursos que s'hi destinin siguin efectius. Perquè també ja fa dos anys els deia, amb dades de l'Idescat, que les polítiques catalanes en aquest tema són molt ineficients. Les transferències socials només estaven aconseguint reduir un 15 per cent el risc de pobresa en la població menor de divuit anys a Catalunya. Amb dades del 2018. Una xifra que és ridícula, si la comparem amb què passa amb altres sectors de la població. Entre les persones de més de seixanta-cinc anys les transferències socials fan reduir un 82 per cent el risc de pobresa. Només un 15 per cent, en el cas dels infants; un 82 per cent en el cas de les persones grans.</seg> <seg xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2" xml:lang="ca">
<!-- ... -->
<s xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7" xml:lang="ca">
<!-- ... -->
<w xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7.3" msd="UPosTag=NUM" lemma="15/100" join="right">15_per_cent</w>
<!-- ... -->
<w xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7.14" msd="UPosTag=NUM" lemma="82/100">82_per_cent</w>
</seg> I understand that you have probably use it to fix wrong tokenization, but you forget to remove underscores same issue different unit: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@rjzevallos, I compared TEI and TEI.ana versions of one file. I don't know how complicated it is to fix these issues - I will be delighted if they are fixed or at least documented because these bugs can decrease corpus usability.
Syntactic words and join collision
TEI version:
TEI.ana version:
TEI version contains sentence:
but TEI.ana version contains different a sentence:
Missing UD features in named entities, wrong UPosTag
No syntactic words in named entities + missing join right
TEI version:
TEI.ana version:
TEI version contains:
but TEI.ana version contains:
Missing join right in articles
TEI:
TEI.ana
TEI version contains:
but TEI.ana version contains:
Syntactic words at the beginning of sentence ???
TEI:
TEI.ana:
TEI version contains:
but TEI.ana version contains:
misplaced join right
TEI:
TEI.ana
TEI:
fixin-s’hi en
vs TEI.ana:fixin-s ’hien
The text was updated successfully, but these errors were encountered: