Segments without sentences in annotated version #555

RubenvanHeusden · 2022-12-21T16:23:07Z

During validation of the final NL corpus, there were several warnings about segments without sentences:

WARN: skipping segment without sentences ParlaMint-NL_2014-04-16-tweedekamer-5.seg6

After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed, in which case we put a gap element in the segment, like below:

 <seg xml:id="ParlaMint-NL_2014-04-16-tweedekamer-5.seg6">
    <gap reason="editorial">
       <desc>Sentence could not be parsed: text_of_unparsable_sentence</desc>
    </gap>
</seg>

@TomazErjavec already suggested using the reason='processingError' for this, so the element would become

 <seg xml:id="ParlaMint-NL_2014-04-16-tweedekamer-5.seg6">
    <gap reason="processingError">
       <desc>text_of_unparsable_sentence</desc>
    </gap>
</seg>

And to omit the segment / utterance if the error results in empty segments.

I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.

@TomazErjavec @matyaskopp , what do you think about this? Is this ok, or is there maybe some way to still keep the reference to the annotated segment? I could of course remove them from both versions, but as far as I could see, the sentences in the plain text versions were valid sentences, so it would be a shame to leave them out.

The text was updated successfully, but these errors were encountered:

TomazErjavec · 2022-12-21T16:38:11Z

I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.

I don't see why this could be problematic but who knows, it might be. E.g. if MT will take the unannotated version of the corpus to translate, but we would also want to link the MTed version to the .ana version.

is there maybe some way to still keep the reference to the annotated segment?

Under the assumption that you would remove only segments (and not whole utterances), you could give the gap the ID of the deleted segment. Let me know if you decide to do this, as currenlty gap cannot have @xml:id, I would then need to add it in the schema.

We could also just give up on the idea that segments need to have at least one sentence but this is a worst-case scenario, as we catch quite a few true errors by having this constraint.

matyaskopp · 2022-12-27T19:01:55Z

After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed...

I don't understand what failed.

tokenization
morphology
syntax
named entities

Did int-tagger produce this error, and then the rest of the tools(udify and flair-ner) was not used?

RubenvanHeusden · 2022-12-28T10:47:11Z

I don't have access to the complete logs of the nlp pipeline as this was done by the Belgian team, but as far as I can tell this happens in the case of very long sentences, or characters that cause the tokenisation to fail, but I am not completely sure about this, I will have to look into this a bit more.

TomazErjavec added this to the Future milestone Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segments without sentences in annotated version #555

Segments without sentences in annotated version #555

RubenvanHeusden commented Dec 21, 2022 •

edited

Loading

TomazErjavec commented Dec 21, 2022

matyaskopp commented Dec 27, 2022

RubenvanHeusden commented Dec 28, 2022

Segments without sentences in annotated version #555

Segments without sentences in annotated version #555

Comments

RubenvanHeusden commented Dec 21, 2022 • edited Loading

TomazErjavec commented Dec 21, 2022

matyaskopp commented Dec 27, 2022

RubenvanHeusden commented Dec 28, 2022

RubenvanHeusden commented Dec 21, 2022 •

edited

Loading