Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BE feedback #496

Open
9 of 10 tasks
matyaskopp opened this issue Dec 1, 2022 · 14 comments · Fixed by #476
Open
9 of 10 tasks

BE feedback #496

matyaskopp opened this issue Dec 1, 2022 · 14 comments · Fixed by #476
Assignees
Milestone

Comments

@matyaskopp
Copy link
Collaborator

matyaskopp commented Dec 1, 2022

I have just a few observations:

Responsibility for lingv. annotations in TEI version

  • remove linguistic annotation responsibility from TEI version

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L17-L21

                <respStmt>
                    <persName>Jesse de Does</persName>
                    <resp xml:lang="nl">Taalkundige verrijking</resp>
                    <resp xml:lang="en">Linguistic annotation</resp>
                </respStmt

Wrong date

  • fix date in text

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L70

<date from="2015-11-12" to="2022-07-13">2015-11-12 - 022-07-13</date>

Taxonomy fusion

  • use common taxonomies without modification, just add translations

You have invented some new taxonomies, and some common ones are modified. It is needed to unify this in v3.1
EG, you used new categories in parla.legislature

<category xml:id="parla.federal">
  <!-- toegevoegd -->
  <catDesc xml:lang="nl">
    <term>Federaal</term>
  </catDesc>
  <catDesc xml:lang="en">
    <term>Federal</term>
  </catDesc>
</category>

You can check CZ folder for how common taxonomies should look.

wrong idno type

  • idno type

please follow the recommendation here: https://clarin-eric.github.io/ParlaMint/#TEI.idno

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L408

<idno type="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

should be

<idno type="URI" subtype="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

settingDesc date in corpus root files

  • settingDesc date
  • remove ana="#parla.sitting" from corpus root files

The date should contain full corpus period
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L392

            <settingDesc>
                <setting>
                    <name type="city">Brussel</name>
                    <name key="BE" type="country">België</name>
<!-- MISSING from and to -->
                    <date ana="#parla.sitting" when="2016-05-26">2016-05-26</date>
                </setting>
            </settingDesc>

speaker note before speech

  • missing annotation type="speaker"
  • move before speech

It is common to have a speaker note before a speech - it is not a part of the speech.
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L108

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">
    <note>01.01 Nabil Boukili (PVDA-PTB):</note>

should be

<note type="speaker">01.01 Nabil Boukili (PVDA-PTB):</note>
<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">

missing parts of transcriptions

  • missing content?

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L494-L497

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u35" who="#VanVaerenberghKristien" xml:lang="nl">
  <note xml:lang="nl">07.02 Kristien Van Vaerenbergh (N-VA):</note>
  <seg xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.seg286" xml:lang="nl">Reeds enige tijd is er een groot aantal vacante plaatsen voor de functie van vrederechter op de Brusselse vredegerechten. In uw beleidsverklaring sprak u van extra investeringen in Justitie onder andere op gebied van informatica en het aanwerven van meer personeel.</seg>
</u>

image

missing notes

  • missing notes

There are a lot of notes like this:

Het incident is gesloten.
L'incident est clos.
De openbare commissievergadering wordt gesloten om 17.19 uur.
La réunion publique de commission est levée à 17 h 19.

Which is missing in component files

@matyaskopp matyaskopp linked a pull request Dec 19, 2022 that will close this issue
@JessedeDoes
Copy link
Collaborator

First the easy ones:

  • We fixed the validation issue found by Tomaz in one of the files
  • We removed the resp statement for linguistic annotation from the annotated files
  • Wrong dates are corrected (also in settingDesc)
  • idno type is corrected
  • speaker note is moved to be before speech

Cf the next comments for the more complex issues.

@JessedeDoes
Copy link
Collaborator

JessedeDoes commented Jan 5, 2023

  • missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as <p>. The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:
                <gap reason="editorial">
                    <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur &quot;Le portefeuille électronique&quot; (55024726C)</desc>
                </gap>

I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.

@JessedeDoes
Copy link
Collaborator

JessedeDoes commented Jan 5, 2023

  • Using common taxonomies.

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology.
BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

  • In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml
  • The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml
  • The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?
    • The extra relations (except for goeswith and dislocated) are extensions of existing ones. Maybe an encoding like
      <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/>
      could become something like
      <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/>
      to distinguish common base layer and extension to make the encoding more interoperable?

@matyaskopp
Copy link
Collaborator Author

  • missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as <p>. The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:
                <gap reason="editorial">
                    <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur &quot;Le portefeuille électronique&quot; (55024726C)</desc>
                </gap>

I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.

Does this mean that you are unsure if it is an utterance <u> or stenographer's notes <note>? I believe this should be a note if you are not sure.

  • Using common taxonomies.

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

yes, taking CZ taxonomy is ok. But for UD-SYN taxonomy, it is better to use this: https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml
This taxonomy is automatically generated from UD documentation and contains all documented relations (even for languages that are not in ParlaMint)

ParlaMint/Makefile

Lines 589 to 594 in 5deaeed

##!create-UD-SYN-taxonomy##
create-taxonomy-UD-SYN:
test -d Scripts/UD-docs || git clone [email protected]:UniversalDependencies/docs.git Scripts/UD-docs
git -C Scripts/UD-docs checkout pages-source
git -C Scripts/UD-docs pull
Scripts/create-taxonomy-UD-SYN.pl --in Scripts/UD-docs --out ParlaMint-taxonomy-UD-SYN.ana.xml

  • In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy.
@TomazErjavec, do we agree on that?

As for the parla.federal category, I think that it should be parla.national and separate parliaments in federation should be parla.regional, so you don't need a parla.federal category.

  • The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml

Yes, the taxonomy is limited, but it is as it is defined in ParlaMint.

  • regular = members of parliament and government
  • chair = chair of the meeting/sitting
  • guest = the any other

If you want to extend this taxonomy, I guess you should create a new one as you did. But if the minister speaker is seeking, then you should use both taxonomies. (I hope this will not break @TomazErjavec script):

<u ana="#regular #minister" ...>

But remember that this categorization is speaker categorization, so if someone holds a minister position, it does not necessarily mean that he is speaking as a minister (not a regular MP) - in CZ, we are not able to distinguish this from the transcription.

The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?

  • The extra relations (except for goeswith and dislocated) are extensions of existing ones. Maybe an encoding like
    <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/>
    could become something like
    <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/>
    to distinguish common base layer and extension to make the encoding more interoperable?

https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml taxonomy cover these situations:

<category xml:id="obl_mod">
<!-- languages: bej ess fr pcm -->
<catDesc xml:lang="en"><term>obl:mod</term>: oblique modifier</catDesc>
</category>

@TomazErjavec
Copy link
Collaborator

I believe this should be a note if you are not sure.

I agree, <note type="editorial"> is better than <gap type="editorial"> .

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy. @TomazErjavec, do we agree on that?

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

@matyaskopp
Copy link
Collaborator Author

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

I can imagine that we can extend parla.legislature taxonomy. But in fact, I think new BE categories are breaking it a bit. The taxonomy describes multiple points of view:

parla.meeting.committee

So if we have category parla.meeting.committee it is a mixture of temporal and organization type. This should hold two categories parla.meeting(temporal) and parla.committee(organization type)

I do not see a reason for adding parla.meeting.committee because it is a kind of hybrid category.

parla.comunity

We don't have "Flemish or Wallonian community" in CZ. What is this category for?

parla.federal

I think this can be replaced with parla.national, or we can add this category between parla.supranational and parla.national. I think it is morelike "province" point of view (not organization type)

@JessedeDoes
Copy link
Collaborator

Multipe speaker types indeed break the validation:

 Error: Type error on line 332 column 49 of parlamint-lib.xsl:
    XTTE0780  A sequence of more than one item is not allowed as the result of a call to
    et:u-role#1 ("Prime Minister", "Regular") 

We interpreted 'regular' as "speaking as member of parliament". If a person holds a minister post at the time of speaker, he/she is not speaking as member of parliament.

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

@JessedeDoes
Copy link
Collaborator

JessedeDoes commented Jan 6, 2023

Summarizing:

  • Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?
  • We removed some unnecessary information from the taxonomies
  • Indeed the other UD relation declaration file contains all we need
  • The samples now pass the github validation

@matyaskopp
Copy link
Collaborator Author

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

Yes, it will probably be the best. We need all corpora to be comparable... Thank

@TomazErjavec
Copy link
Collaborator

I agree with @matyaskopp, all speakers are regular speakers (like MPs, ministers, prime minister), except invited guests, who are not affiliated with the parliament of government. Adding "#minister" would be redundant anyway, we know somebody is a minister given their affiliation and resolving the affiliation to and from with regard to when a person is speaking.

Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?

I think @type="editorial" covers this anyway to an extent (why would the editor put something into a note unless it was problematic). It would not be much work to add @subtype to note, but I am a bit disinclined to do so, if only BE would be using it (while others have similar cases, which they treat as note/@type="editorial").

As for the taxonomy, I would need to find some quality time to understand the whole thing, which I can't seem to find, sigh. Maybe the weekend...

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Jan 9, 2023

invalid url format

  • fix urls

https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE.xml#L70-L78

                    <idno type="URI">https://www.dekamer.be/kvvcr/showpage.cfm?section=/cricra
          &amp;
          language=nl
          &amp;
          cfm=dcricra.cfm?type=plen
          &amp;
          cricra=cri
          &amp;
          count=all</idno>

@matyaskopp
Copy link
Collaborator Author

speeches misclassification

  • speeches misclassification

I still don't understand why there are a lot of speeches misclassification. From my point of view (without language knowledge) HTML classes, elements and other attributes can be used.

Describing this: https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-03-30-definitief-55-commissie-ic427x.xml#L174-L177
which corresponds to this place in the source: https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#TN01

regular/guest speeches start with <p> with one of these classes italFR, NormalNL, NormalFR (and probably italNL). Inside these <p>, there are:

  • (optionally) <a name="TN01"></a> where TN01 is speech number (you can use this anchor in @source attribute - see CZ <u> elements)
  • one or two <span class="oraspr">... which contains number of speech ({topic}.{speech in topic}) and speaker name
  • it is followed by (party): in following span
    It looks like that speech ends when
  • a new speech start
  • or topic is changed (https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#T016), this sequence of elements
<p class=italNL><a name=T016></a><span lang=NL>Het incident is gesloten.</span></p>
<p class=italFR><span lang=FR-BE>L'incident est clos.</span></p>
<p class=MsoNormal>...</p>
<p class=Titre2NL>...

so only the beginning of the meeting and new topic before the first speech can contain unclassified notes or you can classify them as <note type="comment">...</note>

There are also chairman speeches that do not follow upper rules, but you have correctly identified them.

notes do not contain xml:lang

  • xml:lang in <note>

  • this is available in source HTML

strange xml directoves

  • remove directives inside xml document <? ?>

if you want to use <note type="editorial"> then please remove <? ?> which is strange:
https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-06-03-definitief-55-plenair-ip107x.xml#L401-L403

                <note type="editorial">
                    <?uncertain_content_classification Could be possibly be speaker text?>
                    L'incident est clos.
                </note>

But I prefer not to use it, at least in the case above, that can be encoded better:

<note type="comment" xml:lang="fr">L'incident est clos.</note>

or:

<note type="narrative" xml:lang="fr">L'incident est clos.</note>

@matyaskopp matyaskopp mentioned this issue Jan 9, 2023
@JessedeDoes
Copy link
Collaborator

JessedeDoes commented Jan 9, 2023

  • Fixing the URL (strange effect of automatic script reformatting in intellij) will be easy
  • The xml:lang was present on the <p> elements, so echoing it on the notes is not a problem
  • The idea with the processing instruction was to mark these cases as a todo for further processing.
  • Yes, surely the classification of content can be improved. Currently, we do not have any developer with time to work on this refinement; we would prefer to postpone this to a later stage when we will revisit the whole pipeline in order to minimize the amount of manual supervision, so it can run continuously on new available data instead of the current bursty approach

@matyaskopp matyaskopp linked a pull request Feb 1, 2023 that will close this issue
@TomazErjavec
Copy link
Collaborator

@JessedeDoes, in 77e8d95 I've added parla.meeting.committee to the general taxonomy. I'm not absolutely sure if the category belongs where I put it but it might be good enough for now. So, could you copy the new category into your general ParlaMint-taxonomy-parla.legislature taxonomy and remove you additinal taxonomy pls?

<category xml:id="parla.meeting.committee">
<catDesc xml:lang="sl">
<term>Seja delovnega telesa</term>
</catDesc>
<catDesc xml:lang="en">
<term>Committee meeting</term>
</catDesc>
</category>
</category>

@TomazErjavec TomazErjavec added this to the Future milestone Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants