Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating initial TEI from txt and tsv? #341

Open
nljubesi opened this issue Sep 26, 2022 · 2 comments
Open

Generating initial TEI from txt and tsv? #341

nljubesi opened this issue Sep 26, 2022 · 2 comments
Labels
enhancement New feature or request
Milestone

Comments

@nljubesi
Copy link
Collaborator

Hi, I wondered whether there is already code available for generating a preliminary TEI if we have txt and tsv files already available in the proper format.

From what I understand, the generation direction should be the opposite (TEI->(txt,tsv)), but it seems that quite a lot of TEI could be generated, together with placeholders to be additionally filled out manually, from (txt, tsv).

Such a script might come useful to many people, I guess. Mentioning @5roop as he might have fun with preparing such a script, if this is a valid idea. I could not find this discussion neither in open nor in closed issues, but I might have missed something. Sorry if this is the case.

@TomazErjavec
Copy link
Collaborator

No, there is no such script, I think also because most people started from HTML or XML, not plain text. Still, it might be a good thing but it might be more difficult than it seems at first glance:

  • I don't think persons and organisations can be easily modelled as TSV
  • The medata about a corpus component could be stored in TSV
  • The utterances (which is what you probably meant) could be modelled: utterance metadata = tsv, utterance text = txt
  • A problem with the utterances is the transcriptor comments, you would need to introduce some special symbols there.
  • And, I guess you wouldn't then use any "extra" markup, such as page breaks.

There might be other details to consider, once somebody gets to grips with the data. This is now off the top of my head, pending a volunteer to be found...

@coltekin
Copy link
Collaborator

Just in case it is helpful: I have a script to convert CoNLL-U files to ParlaMint v1 at https://github.com/coltekin/ParlaMint-TR/blob/main/tomint.py. This assumes quite some information provided as specific comments in the CoNLL-U files, and a TSV file for speaker information. The version for v2 is in progress, I plan to push it to this repo, once the output passes the validation (hopefully in a few days).

@TomazErjavec TomazErjavec added the enhancement New feature or request label Mar 3, 2024
@TomazErjavec TomazErjavec added this to the Future milestone Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants