PL: many transcriber comments not annotated #84

ajdapretnar · 2021-05-25T08:51:47Z

For the Polish data set, if looking at top keywords, one gets dzwonek (bell) and oklaski (applause). This should not be included in the top keywords, because these are audio notations, not an actual part of the speech.
Dzwonek aslo often gets tagged as a named entity.

matyaskopp · 2021-05-25T09:56:59Z

I probably don't understand the point of this issue...

PL corpus is correctly encoded:

ParlaMint/ParlaMint-PL/ParlaMint-PL_2018-09-27-senat-65-2.ana.xml

Lines 396 to 403 in 3c8ad8a

    
                 </linkGrp> 
        
              </s> 
        
           </seg> 
        
           <kinesic type="applause"> 
        
              <desc>Oklaski</desc> 
        
           </kinesic> 
        
           <seg xml:id="seg747089"> 
        
              <s xml:id="seg747089.1">

In this case, kinesic is within the u (speech element) because it happens during the speech. Everything that has been said by a speaker is in the seg element.

So if you want to get the top keywords, you can just look into //u/seg elements. Or if it is exact words, you can look for w element in the annotated version of the corpus (or look for the //w/@lemma attribute for fusional languages).

Do you have an example of incorrect encoding?

ajdapretnar · 2021-05-25T10:04:07Z

Perhaps NoSketch is not properly recognizing the tags then?

These are the outstanding words for the COVID period for regular MPs.

Then looking at the concordances of "dzwonek". I am not sure whether in this case "Dzwonek" are also people and places, but it looks incorrect (a Polish student confirmed this).

matyaskopp · 2021-05-25T10:44:50Z

Ok, so it is not a problem with data. But It is problem with representation in NoSketch.
I believe that it will be solved with #83, @TomazErjavec, Am I right?

ajdapretnar · 2021-05-25T10:48:13Z

How about the named entities? The "PER:" and "LOC:" tags for "dzwonek"? Do they make sense in the original? Or might this be an issue?

matyaskopp · 2021-05-25T10:57:54Z

Oh, I have not studied the screenshot carefully - It is really weird - it should not happen!

Every named entity should contain at least one token.

TomazErjavec · 2021-05-25T11:16:37Z

Ok, so it is not a problem with data. But It is problem with representation in NoSketch. I believe that it will be solved with #83, @TomazErjavec, Am I right?

Not quite: currently, the contents of incidents (represented in vert/noSkE as <note>) are indeed encoded in vert/noSkE as 1 token, however, these tokens a) are bracketed (so e..g "[Dzwonek]" and b) without annotations, i.e. they do not get lemma, pos, etc., and neither should they be included in <name>/NER tags. So, all the examples of "Dzwonek" that @ajdapretnar has shown above in the concordances, ara part of regular text. At the same time, some (well, most) do appear inside incidents, ie. are correctly encoded.

These are the stats:

$ cat ParlaMint-PL.vert/*.vert | fgrep -c Dzwonek
17904
$ cat ParlaMint-PL.vert/*.vert | fgrep -c '[Dzwonek]'
15603

So, most are ok, but by no means all.

TomazErjavec · 2021-06-02T13:36:58Z

OK, the summary is that PL has 2301 cases of "Dzwonek" as part of the text, when, presumably (?) they should be encoded as incidents. This is about 13% of all occurences of "Dzwonek", so not a negligible amount. Not sure why @matyaskopp removed the bug label, as this presumably is a bug. I guess the issue should remain open (even though its name is not really the best) in the hope that somebody fixes this in the fullness of time.

matyaskopp · 2021-06-02T13:59:53Z

I assumed that a "Dzwonek" is part of the speech (not just a note), but if it should be encoded as an incident, then it is a bug (placing the label back).

I was not able to trace it back to original source data to see how it is encoded in the source, because PL data do not precisely reference source. (This should be an issue for next releases - keeping back-references to source)

TomazErjavec · 2022-12-22T11:21:01Z

@mrudolf, don't forget to address this issue pls. And close or tell us to close when fixed.

mrudolf · 2023-01-25T19:10:38Z

Alas, pandemic sessions are apparently badly annotated by the Parliament, with many speakers missing.

We are now proof-reading all the sessions, but this will be finished in February. I hope it would be possible to update our corpora to the corrected version then.

TomazErjavec · 2023-06-08T18:26:09Z

This issue still persists in the (draft) 3.0: e.g. for the query "(, Dzwonek, )" there are 6,785 hits.
However, to be fair, for all the ones I looked at "(Dzwonek)" appear in the middle of a sentence, so it is a bit complicated to do the correct annotattions.

Moving this to milestone 3.1 in the hope that @mrudolf might fix this then. And that we can then get PL re-MTed...

TomazErjavec · 2023-09-17T18:36:20Z

@mrudolf has not fixed this for 3.1, so moving to "future" milestone.

mrudolf · 2023-09-17T18:38:05Z

Alas, our proofreaders did not finish correcting that so I haven't rerun the annotation yet. Will there be 3.2?

TomazErjavec · 2023-09-18T06:10:38Z

Will there be 3.2?

Who knows... The project is ending now, so, unless there is somehow another, maybe not.

matyaskopp added the bug Something isn't working label May 25, 2021

matyaskopp mentioned this issue May 25, 2021

Schema: NER restriction #85

Open

matyaskopp removed the bug Something isn't working label May 25, 2021

TomazErjavec added this to the next milestone Jun 2, 2021

matyaskopp added the bug Something isn't working label Jun 2, 2021

matyaskopp assigned maciej-ogrodniczuk Feb 23, 2022

TomazErjavec changed the title ~~PL dataset: audio references should be excluded from analysis~~ PL: audio references should be excluded from analysis May 23, 2022

TomazErjavec modified the milestones: next, ParlaMint 3.0 release Nov 12, 2022

TomazErjavec assigned mrudolf Dec 22, 2022

TomazErjavec changed the title ~~PL: audio references should be excluded from analysis~~ PL: many transciber comments not annotated Jun 8, 2023

TomazErjavec modified the milestones: ParlaMint 3.0 release, ParlaMint 3.1 release Jun 8, 2023

TomazErjavec changed the title ~~PL: many transciber comments not annotated~~ PL: many transcriber comments not annotated Jun 8, 2023

TomazErjavec modified the milestones: ParlaMint 3.1 release, Future Sep 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PL: many transcriber comments not annotated #84

PL: many transcriber comments not annotated #84

ajdapretnar commented May 25, 2021

matyaskopp commented May 25, 2021

ajdapretnar commented May 25, 2021 •

edited

Loading

matyaskopp commented May 25, 2021

ajdapretnar commented May 25, 2021

matyaskopp commented May 25, 2021

TomazErjavec commented May 25, 2021

TomazErjavec commented Jun 2, 2021

matyaskopp commented Jun 2, 2021

TomazErjavec commented Dec 22, 2022

mrudolf commented Jan 25, 2023

TomazErjavec commented Jun 8, 2023

TomazErjavec commented Sep 17, 2023

mrudolf commented Sep 17, 2023

TomazErjavec commented Sep 18, 2023

PL: many transcriber comments not annotated #84

PL: many transcriber comments not annotated #84

Comments

ajdapretnar commented May 25, 2021

matyaskopp commented May 25, 2021

ajdapretnar commented May 25, 2021 • edited Loading

matyaskopp commented May 25, 2021

ajdapretnar commented May 25, 2021

matyaskopp commented May 25, 2021

TomazErjavec commented May 25, 2021

TomazErjavec commented Jun 2, 2021

matyaskopp commented Jun 2, 2021

TomazErjavec commented Dec 22, 2022

mrudolf commented Jan 25, 2023

TomazErjavec commented Jun 8, 2023

TomazErjavec commented Sep 17, 2023

mrudolf commented Sep 17, 2023

TomazErjavec commented Sep 18, 2023

ajdapretnar commented May 25, 2021 •

edited

Loading