BE feedback #496

matyaskopp · 2022-12-01T06:45:05Z

I have just a few observations:

Responsibility for lingv. annotations in TEI version

remove linguistic annotation responsibility from TEI version

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L17-L21

                <respStmt>
                    <persName>Jesse de Does</persName>
                    <resp xml:lang="nl">Taalkundige verrijking</resp>
                    <resp xml:lang="en">Linguistic annotation</resp>
                </respStmt

Wrong date

fix date in text

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L70

<date from="2015-11-12" to="2022-07-13">2015-11-12 - 022-07-13</date>

Taxonomy fusion

use common taxonomies without modification, just add translations

You have invented some new taxonomies, and some common ones are modified. It is needed to unify this in v3.1
EG, you used new categories in parla.legislature

<category xml:id="parla.federal">
  <!-- toegevoegd -->
  <catDesc xml:lang="nl">
    <term>Federaal</term>
  </catDesc>
  <catDesc xml:lang="en">
    <term>Federal</term>
  </catDesc>
</category>

You can check CZ folder for how common taxonomies should look.

wrong idno type

idno type

please follow the recommendation here: https://clarin-eric.github.io/ParlaMint/#TEI.idno

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L408

<idno type="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

should be

<idno type="URI" subtype="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

settingDesc date in corpus root files

settingDesc date
remove ana="#parla.sitting" from corpus root files

The date should contain full corpus period
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L392

            <settingDesc>
                <setting>
                    <name type="city">Brussel</name>
                    <name key="BE" type="country">België</name>
<!-- MISSING from and to -->
                    <date ana="#parla.sitting" when="2016-05-26">2016-05-26</date>
                </setting>
            </settingDesc>

speaker note before speech

missing annotation type="speaker"
move before speech

It is common to have a speaker note before a speech - it is not a part of the speech.
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L108

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">
    <note>01.01 Nabil Boukili (PVDA-PTB):</note>

should be

<note type="speaker">01.01 Nabil Boukili (PVDA-PTB):</note>
<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">

missing parts of transcriptions

missing content?

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L494-L497

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u35" who="#VanVaerenberghKristien" xml:lang="nl">
  <note xml:lang="nl">07.02 Kristien Van Vaerenbergh (N-VA):</note>
  <seg xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.seg286" xml:lang="nl">Reeds enige tijd is er een groot aantal vacante plaatsen voor de functie van vrederechter op de Brusselse vredegerechten. In uw beleidsverklaring sprak u van extra investeringen in Justitie onder andere op gebied van informatica en het aanwerven van meer personeel.</seg>
</u>

missing notes

missing notes

There are a lot of notes like this:

Het incident is gesloten.
L'incident est clos.
De openbare commissievergadering wordt gesloten om 17.19 uur.
La réunion publique de commission est levée à 17 h 19.

Which is missing in component files

The text was updated successfully, but these errors were encountered:

JessedeDoes · 2023-01-05T09:40:01Z

First the easy ones:

We fixed the validation issue found by Tomaz in one of the files
We removed the resp statement for linguistic annotation from the annotated files
Wrong dates are corrected (also in settingDesc)
idno type is corrected
speaker note is moved to be before speech

Cf the next comments for the more complex issues.

JessedeDoes · 2023-01-05T09:40:10Z

missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as . The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:

                <gap reason="editorial">
                    <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur &quot;Le portefeuille électronique&quot; (55024726C)</desc>
                </gap>

I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.

JessedeDoes · 2023-01-05T10:24:58Z

Using common taxonomies.

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology.
BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml
The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml
The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?
- The extra relations (except for goeswith and dislocated) are extensions of existing ones. Maybe an encoding like
 <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/>
 could become something like
 <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/>
 to distinguish common base layer and extension to make the encoding more interoperable?

matyaskopp · 2023-01-05T14:29:35Z

missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as . The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:
 <gap reason="editorial">
 <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur &quot;Le portefeuille électronique&quot; (55024726C)</desc>
 </gap>
I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.

Does this mean that you are unsure if it is an utterance  or stenographer's notes <note>? I believe this should be a note if you are not sure.

Using common taxonomies.

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

yes, taking CZ taxonomy is ok. But for UD-SYN taxonomy, it is better to use this: https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml
This taxonomy is automatically generated from UD documentation and contains all documented relations (even for languages that are not in ParlaMint)

ParlaMint/Makefile

Lines 589 to 594 in 5deaeed

	##!create-UD-SYN-taxonomy##
	create-taxonomy-UD-SYN:
	test -d Scripts/UD-docs \|\| git clone [email protected]:UniversalDependencies/docs.git Scripts/UD-docs
	git -C Scripts/UD-docs checkout pages-source
	git -C Scripts/UD-docs pull
	Scripts/create-taxonomy-UD-SYN.pl --in Scripts/UD-docs --out ParlaMint-taxonomy-UD-SYN.ana.xml

In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy.
@TomazErjavec, do we agree on that?

As for the parla.federal category, I think that it should be parla.national and separate parliaments in federation should be parla.regional, so you don't need a parla.federal category.

The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml

Yes, the taxonomy is limited, but it is as it is defined in ParlaMint.

regular = members of parliament and government
chair = chair of the meeting/sitting
guest = the any other

If you want to extend this taxonomy, I guess you should create a new one as you did. But if the minister speaker is seeking, then you should use both taxonomies. (I hope this will not break @TomazErjavec script):

<u ana="#regular #minister" ...>

But remember that this categorization is speaker categorization, so if someone holds a minister position, it does not necessarily mean that he is speaking as a minister (not a regular MP) - in CZ, we are not able to distinguish this from the transcription.

The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?

The extra relations (except for goeswith and dislocated) are extensions of existing ones. Maybe an encoding like
<link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/>
could become something like
<link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/>
to distinguish common base layer and extension to make the encoding more interoperable?

https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml taxonomy cover these situations:

ParlaMint/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml

Lines 1210 to 1213 in 5deaeed

    
           <category xml:id="obl_mod"> 
        
              <!-- languages: bej ess fr pcm --> 
        
              <catDesc xml:lang="en"><term>obl:mod</term>: oblique modifier</catDesc> 
        
           </category>

if not
- missing documentation - should be reported here: https://github.com/UniversalDependencies/docs/issues
- bug in annotation tool (/training data) - provides relation that does not exist -> should be replaced with universal relation dep

TomazErjavec · 2023-01-05T14:40:09Z

I believe this should be a note if you are not sure.

I agree, <note type="editorial"> is better than <gap type="editorial"> .

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy. @TomazErjavec, do we agree on that?

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

matyaskopp · 2023-01-05T15:12:15Z

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

I can imagine that we can extend parla.legislature taxonomy. But in fact, I think new BE categories are breaking it a bit. The taxonomy describes multiple points of view:

province

ParlaMint/Data/ParlaMint-CZ/ParlaMint-taxonomy-parla.legislature.xml

Line 11 in 5deaeed

<category xml:id="parla.geo-political">
organization type

ParlaMint/Data/ParlaMint-CZ/ParlaMint-taxonomy-parla.legislature.xml

Line 51 in 5deaeed

<category xml:id="parla.organization">
temporal

ParlaMint/Data/ParlaMint-CZ/ParlaMint-taxonomy-parla.legislature.xml

Line 147 in 5deaeed

<category xml:id="parla.term">

`parla.meeting.committee`

So if we have category parla.meeting.committee it is a mixture of temporal and organization type. This should hold two categories parla.meeting(temporal) and parla.committee(organization type)

I do not see a reason for adding parla.meeting.committee because it is a kind of hybrid category.

`parla.comunity`

We don't have "Flemish or Wallonian community" in CZ. What is this category for?

`parla.federal`

I think this can be replaced with parla.national, or we can add this category between parla.supranational and parla.national. I think it is morelike "province" point of view (not organization type)

JessedeDoes · 2023-01-06T08:50:50Z

Multipe speaker types indeed break the validation:

 Error: Type error on line 332 column 49 of parlamint-lib.xsl:
    XTTE0780  A sequence of more than one item is not allowed as the result of a call to
    et:u-role#1 ("Prime Minister", "Regular")

We interpreted 'regular' as "speaking as member of parliament". If a person holds a minister post at the time of speaker, he/she is not speaking as member of parliament.

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

JessedeDoes · 2023-01-06T09:29:16Z

Summarizing:

Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?
We removed some unnecessary information from the taxonomies
Indeed the other UD relation declaration file contains all we need
The samples now pass the github validation

matyaskopp · 2023-01-06T09:31:40Z

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

Yes, it will probably be the best. We need all corpora to be comparable... Thank

TomazErjavec · 2023-01-06T09:50:49Z

I agree with @matyaskopp, all speakers are regular speakers (like MPs, ministers, prime minister), except invited guests, who are not affiliated with the parliament of government. Adding "#minister" would be redundant anyway, we know somebody is a minister given their affiliation and resolving the affiliation to and from with regard to when a person is speaking.

Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?

I think @type="editorial" covers this anyway to an extent (why would the editor put something into a note unless it was problematic). It would not be much work to add @subtype to note, but I am a bit disinclined to do so, if only BE would be using it (while others have similar cases, which they treat as note/@type="editorial").

As for the taxonomy, I would need to find some quality time to understand the whole thing, which I can't seem to find, sigh. Maybe the weekend...

matyaskopp · 2023-01-09T14:46:08Z

invalid url format

fix urls

https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE.xml#L70-L78

                    <idno type="URI">https://www.dekamer.be/kvvcr/showpage.cfm?section=/cricra
          &amp;
          language=nl
          &amp;
          cfm=dcricra.cfm?type=plen
          &amp;
          cricra=cri
          &amp;
          count=all</idno>

matyaskopp · 2023-01-09T15:58:54Z

speeches misclassification

speeches misclassification

I still don't understand why there are a lot of speeches misclassification. From my point of view (without language knowledge) HTML classes, elements and other attributes can be used.

Describing this: https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-03-30-definitief-55-commissie-ic427x.xml#L174-L177
which corresponds to this place in the source: https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#TN01

regular/guest speeches start with  with one of these classes italFR, NormalNL, NormalFR (and probably italNL). Inside these , there are:

(optionally) <a name="TN01"></a> where TN01 is speech number (you can use this anchor in @source attribute - see CZ  elements)
one or two ... which contains number of speech ({topic}.{speech in topic}) and speaker name
it is followed by (party): in following span
It looks like that speech ends when
a new speech start
or topic is changed (https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#T016), this sequence of elements

<p class=italNL><a name=T016></a><span lang=NL>Het incident is gesloten.</span></p>
<p class=italFR><span lang=FR-BE>L'incident est clos.</span></p>
<p class=MsoNormal>...</p>
<p class=Titre2NL>...

so only the beginning of the meeting and new topic before the first speech can contain unclassified notes or you can classify them as <note type="comment">...</note>

There are also chairman speeches that do not follow upper rules, but you have correctly identified them.

notes do not contain `xml:lang`

xml:lang in <note>
this is available in source HTML

strange xml directoves

remove directives inside xml document <? ?>

if you want to use <note type="editorial"> then please remove <? ?> which is strange:
https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-06-03-definitief-55-plenair-ip107x.xml#L401-L403

                <note type="editorial">
                    <?uncertain_content_classification Could be possibly be speaker text?>
                    L'incident est clos.
                </note>

But I prefer not to use it, at least in the case above, that can be encoded better:

<note type="comment" xml:lang="fr">L'incident est clos.</note>

or:

<note type="narrative" xml:lang="fr">L'incident est clos.</note>

JessedeDoes · 2023-01-09T18:06:46Z

Fixing the URL (strange effect of automatic script reformatting in intellij) will be easy
The xml:lang was present on the  elements, so echoing it on the notes is not a problem
The idea with the processing instruction was to mark these cases as a todo for further processing.
Yes, surely the classification of content can be improved. Currently, we do not have any developer with time to work on this refinement; we would prefer to postpone this to a later stage when we will revisit the whole pipeline in order to minimize the amount of manual supervision, so it can run continuously on new available data instead of the current bursty approach

TomazErjavec · 2023-02-02T18:42:09Z

@JessedeDoes, in 77e8d95 I've added parla.meeting.committee to the general taxonomy. I'm not absolutely sure if the category belongs where I put it but it might be good enough for now. So, could you copy the new category into your general ParlaMint-taxonomy-parla.legislature taxonomy and remove you additinal taxonomy pls?

ParlaMint/Data/Taxonomies/ParlaMint-taxonomy-parla.legislature.xml

Lines 225 to 233 in 77e8d95

    
             <category xml:id="parla.meeting.committee"> 
        
               <catDesc xml:lang="sl"> 
        
                 <term>Seja delovnega telesa</term> 
        
               </catDesc> 
        
               <catDesc xml:lang="en"> 
        
                 <term>Committee meeting</term> 
        
               </catDesc> 
        
             </category> 
        
           </category>

matyaskopp assigned JessedeDoes Dec 1, 2022

matyaskopp linked a pull request Dec 19, 2022 that will close this issue

Data BE #476

Merged

matyaskopp mentioned this issue Jan 9, 2023

Data BE #577

Closed

matyaskopp mentioned this issue Jan 15, 2023

NO text submission #513

Closed

matyaskopp mentioned this issue Jan 25, 2023

Empty speech.house #584

Closed

matyaskopp linked a pull request Feb 1, 2023 that will close this issue

Data BE #577

Closed

TomazErjavec added this to the Future milestone Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BE feedback #496

BE feedback #496

matyaskopp commented Dec 1, 2022 •

edited by TomazErjavec

Loading

JessedeDoes commented Jan 5, 2023

JessedeDoes commented Jan 5, 2023 •

edited

Loading

JessedeDoes commented Jan 5, 2023 •

edited

Loading

matyaskopp commented Jan 5, 2023

TomazErjavec commented Jan 5, 2023

matyaskopp commented Jan 5, 2023

JessedeDoes commented Jan 6, 2023

JessedeDoes commented Jan 6, 2023 •

edited

Loading

matyaskopp commented Jan 6, 2023

TomazErjavec commented Jan 6, 2023

matyaskopp commented Jan 9, 2023 •

edited

Loading

matyaskopp commented Jan 9, 2023

JessedeDoes commented Jan 9, 2023 •

edited

Loading

TomazErjavec commented Feb 2, 2023

BE feedback #496

BE feedback #496

Comments

matyaskopp commented Dec 1, 2022 • edited by TomazErjavec Loading

Responsibility for lingv. annotations in TEI version

Wrong date

Taxonomy fusion

wrong idno type

settingDesc date in corpus root files

speaker note before speech

missing parts of transcriptions

missing notes

JessedeDoes commented Jan 5, 2023

JessedeDoes commented Jan 5, 2023 • edited Loading

JessedeDoes commented Jan 5, 2023 • edited Loading

matyaskopp commented Jan 5, 2023

TomazErjavec commented Jan 5, 2023

matyaskopp commented Jan 5, 2023

parla.meeting.committee

parla.comunity

parla.federal

JessedeDoes commented Jan 6, 2023

JessedeDoes commented Jan 6, 2023 • edited Loading

matyaskopp commented Jan 6, 2023

TomazErjavec commented Jan 6, 2023

matyaskopp commented Jan 9, 2023 • edited Loading

invalid url format

matyaskopp commented Jan 9, 2023

speeches misclassification

notes do not contain xml:lang

strange xml directoves

JessedeDoes commented Jan 9, 2023 • edited Loading

TomazErjavec commented Feb 2, 2023

matyaskopp commented Dec 1, 2022 •

edited by TomazErjavec

Loading

JessedeDoes commented Jan 5, 2023 •

edited

Loading

JessedeDoes commented Jan 5, 2023 •

edited

Loading

`parla.meeting.committee`

`parla.comunity`

`parla.federal`

JessedeDoes commented Jan 6, 2023 •

edited

Loading

matyaskopp commented Jan 9, 2023 •

edited

Loading

notes do not contain `xml:lang`

JessedeDoes commented Jan 9, 2023 •

edited

Loading