Generating initial TEI from txt and tsv? #341

nljubesi · 2022-09-26T06:27:27Z

Hi, I wondered whether there is already code available for generating a preliminary TEI if we have txt and tsv files already available in the proper format.

From what I understand, the generation direction should be the opposite (TEI->(txt,tsv)), but it seems that quite a lot of TEI could be generated, together with placeholders to be additionally filled out manually, from (txt, tsv).

Such a script might come useful to many people, I guess. Mentioning @5roop as he might have fun with preparing such a script, if this is a valid idea. I could not find this discussion neither in open nor in closed issues, but I might have missed something. Sorry if this is the case.

TomazErjavec · 2022-09-26T09:24:40Z

No, there is no such script, I think also because most people started from HTML or XML, not plain text. Still, it might be a good thing but it might be more difficult than it seems at first glance:

I don't think persons and organisations can be easily modelled as TSV
The medata about a corpus component could be stored in TSV
The utterances (which is what you probably meant) could be modelled: utterance metadata = tsv, utterance text = txt
A problem with the utterances is the transcriptor comments, you would need to introduce some special symbols there.
And, I guess you wouldn't then use any "extra" markup, such as page breaks.

There might be other details to consider, once somebody gets to grips with the data. This is now off the top of my head, pending a volunteer to be found...

coltekin · 2022-09-26T10:01:02Z

Just in case it is helpful: I have a script to convert CoNLL-U files to ParlaMint v1 at https://github.com/coltekin/ParlaMint-TR/blob/main/tomint.py. This assumes quite some information provided as specific comments in the CoNLL-U files, and a TSV file for speaker information. The version for v2 is in progress, I plan to push it to this repo, once the output passes the validation (hopefully in a few days).

TomazErjavec added the enhancement New feature or request label Mar 3, 2024

TomazErjavec added this to the Future milestone Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating initial TEI from txt and tsv? #341

Generating initial TEI from txt and tsv? #341

nljubesi commented Sep 26, 2022

TomazErjavec commented Sep 26, 2022

coltekin commented Sep 26, 2022

Generating initial TEI from txt and tsv? #341

Generating initial TEI from txt and tsv? #341

Comments

nljubesi commented Sep 26, 2022

TomazErjavec commented Sep 26, 2022

coltekin commented Sep 26, 2022