Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PL: many transcriber comments not annotated #84

Open
ajdapretnar opened this issue May 25, 2021 · 14 comments
Open

PL: many transcriber comments not annotated #84

ajdapretnar opened this issue May 25, 2021 · 14 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@ajdapretnar
Copy link
Collaborator

For the Polish data set, if looking at top keywords, one gets dzwonek (bell) and oklaski (applause). This should not be included in the top keywords, because these are audio notations, not an actual part of the speech.
Dzwonek aslo often gets tagged as a named entity.

@matyaskopp
Copy link
Collaborator

I probably don't understand the point of this issue...

PL corpus is correctly encoded:

</linkGrp>
</s>
</seg>
<kinesic type="applause">
<desc>Oklaski</desc>
</kinesic>
<seg xml:id="seg747089">
<s xml:id="seg747089.1">

In this case, kinesic is within the u (speech element) because it happens during the speech. Everything that has been said by a speaker is in the seg element.

So if you want to get the top keywords, you can just look into //u/seg elements. Or if it is exact words, you can look for w element in the annotated version of the corpus (or look for the //w/@lemma attribute for fusional languages).

Do you have an example of incorrect encoding?

@ajdapretnar
Copy link
Collaborator Author

ajdapretnar commented May 25, 2021

Perhaps NoSketch is not properly recognizing the tags then?

These are the outstanding words for the COVID period for regular MPs.
Screen Shot 2021-05-25 at 12 01 29

Then looking at the concordances of "dzwonek". I am not sure whether in this case "Dzwonek" are also people and places, but it looks incorrect (a Polish student confirmed this).
Screen Shot 2021-05-25 at 12 01 44

@matyaskopp
Copy link
Collaborator

Ok, so it is not a problem with data. But It is problem with representation in NoSketch.
I believe that it will be solved with #83, @TomazErjavec, Am I right?

@ajdapretnar
Copy link
Collaborator Author

How about the named entities? The "PER:" and "LOC:" tags for "dzwonek"? Do they make sense in the original? Or might this be an issue?

@matyaskopp
Copy link
Collaborator

Oh, I have not studied the screenshot carefully - It is really weird - it should not happen!

Every named entity should contain at least one token.

@matyaskopp matyaskopp added the bug Something isn't working label May 25, 2021
@TomazErjavec
Copy link
Collaborator

Ok, so it is not a problem with data. But It is problem with representation in NoSketch. I believe that it will be solved with #83, @TomazErjavec, Am I right?

Not quite: currently, the contents of incidents (represented in vert/noSkE as <note>) are indeed encoded in vert/noSkE as 1 token, however, these tokens a) are bracketed (so e..g "[Dzwonek]" and b) without annotations, i.e. they do not get lemma, pos, etc., and neither should they be included in <name>/NER tags. So, all the examples of "Dzwonek" that @ajdapretnar has shown above in the concordances, ara part of regular text. At the same time, some (well, most) do appear inside incidents, ie. are correctly encoded.

These are the stats:

$ cat ParlaMint-PL.vert/*.vert | fgrep -c Dzwonek
17904
$ cat ParlaMint-PL.vert/*.vert | fgrep -c '[Dzwonek]'
15603

So, most are ok, but by no means all.

@matyaskopp matyaskopp removed the bug Something isn't working label May 25, 2021
@TomazErjavec TomazErjavec added this to the next milestone Jun 2, 2021
@TomazErjavec
Copy link
Collaborator

OK, the summary is that PL has 2301 cases of "Dzwonek" as part of the text, when, presumably (?) they should be encoded as incidents. This is about 13% of all occurences of "Dzwonek", so not a negligible amount. Not sure why @matyaskopp removed the bug label, as this presumably is a bug. I guess the issue should remain open (even though its name is not really the best) in the hope that somebody fixes this in the fullness of time.

@matyaskopp
Copy link
Collaborator

I assumed that a "Dzwonek" is part of the speech (not just a note), but if it should be encoded as an incident, then it is a bug (placing the label back).

I was not able to trace it back to original source data to see how it is encoded in the source, because PL data do not precisely reference source. (This should be an issue for next releases - keeping back-references to source)

@matyaskopp matyaskopp added the bug Something isn't working label Jun 2, 2021
@TomazErjavec TomazErjavec changed the title PL dataset: audio references should be excluded from analysis PL: audio references should be excluded from analysis May 23, 2022
@TomazErjavec TomazErjavec modified the milestones: next, ParlaMint 3.0 release Nov 12, 2022
@TomazErjavec
Copy link
Collaborator

@mrudolf, don't forget to address this issue pls. And close or tell us to close when fixed.

@mrudolf
Copy link
Collaborator

mrudolf commented Jan 25, 2023

Alas, pandemic sessions are apparently badly annotated by the Parliament, with many speakers missing.

We are now proof-reading all the sessions, but this will be finished in February. I hope it would be possible to update our corpora to the corrected version then.

@TomazErjavec TomazErjavec changed the title PL: audio references should be excluded from analysis PL: many transciber comments not annotated Jun 8, 2023
@TomazErjavec
Copy link
Collaborator

This issue still persists in the (draft) 3.0: e.g. for the query "(, Dzwonek, )" there are 6,785 hits.
However, to be fair, for all the ones I looked at "(Dzwonek)" appear in the middle of a sentence, so it is a bit complicated to do the correct annotattions.

Moving this to milestone 3.1 in the hope that @mrudolf might fix this then. And that we can then get PL re-MTed...

@TomazErjavec TomazErjavec changed the title PL: many transciber comments not annotated PL: many transcriber comments not annotated Jun 8, 2023
@TomazErjavec
Copy link
Collaborator

@mrudolf has not fixed this for 3.1, so moving to "future" milestone.

@mrudolf
Copy link
Collaborator

mrudolf commented Sep 17, 2023

Alas, our proofreaders did not finish correcting that so I haven't rerun the annotation yet. Will there be 3.2?

@TomazErjavec
Copy link
Collaborator

Will there be 3.2?

Who knows... The project is ending now, so, unless there is somehow another, maybe not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants