-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PL: many transcriber comments not annotated #84
Comments
I probably don't understand the point of this issue... PL corpus is correctly encoded: ParlaMint/ParlaMint-PL/ParlaMint-PL_2018-09-27-senat-65-2.ana.xml Lines 396 to 403 in 3c8ad8a
In this case, So if you want to get the top keywords, you can just look into Do you have an example of incorrect encoding? |
Ok, so it is not a problem with data. But It is problem with representation in NoSketch. |
How about the named entities? The "PER:" and "LOC:" tags for "dzwonek"? Do they make sense in the original? Or might this be an issue? |
Oh, I have not studied the screenshot carefully - It is really weird - it should not happen! Every named entity should contain at least one token. |
Not quite: currently, the contents of incidents (represented in vert/noSkE as These are the stats:
So, most are ok, but by no means all. |
OK, the summary is that PL has 2301 cases of "Dzwonek" as part of the text, when, presumably (?) they should be encoded as incidents. This is about 13% of all occurences of "Dzwonek", so not a negligible amount. Not sure why @matyaskopp removed the bug label, as this presumably is a bug. I guess the issue should remain open (even though its name is not really the best) in the hope that somebody fixes this in the fullness of time. |
I assumed that a "Dzwonek" is part of the speech (not just a note), but if it should be encoded as an incident, then it is a bug (placing the label back). I was not able to trace it back to original source data to see how it is encoded in the source, because PL data do not precisely reference source. (This should be an issue for next releases - keeping back-references to source) |
@mrudolf, don't forget to address this issue pls. And close or tell us to close when fixed. |
Alas, pandemic sessions are apparently badly annotated by the Parliament, with many speakers missing. We are now proof-reading all the sessions, but this will be finished in February. I hope it would be possible to update our corpora to the corrected version then. |
This issue still persists in the (draft) 3.0: e.g. for the query "(, Dzwonek, )" there are 6,785 hits. Moving this to milestone 3.1 in the hope that @mrudolf might fix this then. And that we can then get PL re-MTed... |
@mrudolf has not fixed this for 3.1, so moving to "future" milestone. |
Alas, our proofreaders did not finish correcting that so I haven't rerun the annotation yet. Will there be 3.2? |
Who knows... The project is ending now, so, unless there is somehow another, maybe not. |
For the Polish data set, if looking at top keywords, one gets dzwonek (bell) and oklaski (applause). This should not be included in the top keywords, because these are audio notations, not an actual part of the speech.
Dzwonek aslo often gets tagged as a named entity.
The text was updated successfully, but these errors were encountered: