Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segments without sentences in annotated version #555

Open
RubenvanHeusden opened this issue Dec 21, 2022 · 3 comments
Open

Segments without sentences in annotated version #555

RubenvanHeusden opened this issue Dec 21, 2022 · 3 comments
Milestone

Comments

@RubenvanHeusden
Copy link
Collaborator

RubenvanHeusden commented Dec 21, 2022

During validation of the final NL corpus, there were several warnings about segments without sentences:

WARN: skipping segment without sentences ParlaMint-NL_2014-04-16-tweedekamer-5.seg6

After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed, in which case we put a gap element in the segment, like below:

 <seg xml:id="ParlaMint-NL_2014-04-16-tweedekamer-5.seg6">
    <gap reason="editorial">
       <desc>Sentence could not be parsed: text_of_unparsable_sentence</desc>
    </gap>
</seg>

@TomazErjavec already suggested using the reason='processingError' for this, so the element would become

 <seg xml:id="ParlaMint-NL_2014-04-16-tweedekamer-5.seg6">
    <gap reason="processingError">
       <desc>text_of_unparsable_sentence</desc>
    </gap>
</seg>

And to omit the segment / utterance if the error results in empty segments.

I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.

@TomazErjavec @matyaskopp , what do you think about this? Is this ok, or is there maybe some way to still keep the reference to the annotated segment? I could of course remove them from both versions, but as far as I could see, the sentences in the plain text versions were valid sentences, so it would be a shame to leave them out.

@TomazErjavec
Copy link
Collaborator

I like this solution, however this would mean that for certain sentences in the plain text version, there is no segment with the corresponding ID in the annotated version, which might be problematic.

I don't see why this could be problematic but who knows, it might be. E.g. if MT will take the unannotated version of the corpus to translate, but we would also want to link the MTed version to the .ana version.

is there maybe some way to still keep the reference to the annotated segment?

Under the assumption that you would remove only segments (and not whole utterances), you could give the gap the ID of the deleted segment. Let me know if you decide to do this, as currenlty gap cannot have @xml:id, I would then need to add it in the schema.

We could also just give up on the idea that segments need to have at least one sentence but this is a worst-case scenario, as we catch quite a few true errors by having this constraint.

@matyaskopp
Copy link
Collaborator

After some investigation, I found out that for NL, this is because certain sentences could not be linguistically parsed...

I don't understand what failed.

  • tokenization
  • morphology
  • syntax
  • named entities

Did int-tagger produce this error, and then the rest of the tools(udify and flair-ner) was not used?

@RubenvanHeusden
Copy link
Collaborator Author

I don't have access to the complete logs of the nlp pipeline as this was done by the Belgian team, but as far as I can tell this happens in the case of very long sentences, or characters that cause the tokenisation to fail, but I am not completely sure about this, I will have to look into this a bit more.

@TomazErjavec TomazErjavec added this to the Future milestone Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants