Convert text files with Stanza (NLP) to AntConc-Format for different languages
This repository contains a collection of scripts for the automatic processing and annotation of Church Slavonic, Latin, and Ancient Greek corpus materials using Stanza. These scripts were developed as part of the training on Corpus Linguistics in Slavic Studies (Online-Workshop 1, 2)and enable the automatic analysis of texts with modern NLP methods.
The scripts cover the entire text processing pipeline:
- TXT → CoNLL-U: Conversion of plain text files into CoNLL-U format with lemmatization, POS tagging, and dependency parsing.
- CoNLL-U → POS: Extraction of POS-tagged text from CoNLL-U files for a simplified view.
- CoNLL-U → SQLite: Storage of CoNLL-U data in an SQLite database for structured analysis.
File | Description |
---|---|
stanza-txt-2-conllu-grc.ipynb |
Converts Ancient Greek texts into CoNLL-U format |
stanza-txt-2-conllu-lat.ipynb |
Converts Latin texts into CoNLL-U format |
stanza-txt-2-conllu-chu.ipynb |
Converts Church Slavonic texts into CoNLL-U format |
stanza-conllu-2-pos-grc.ipynb |
Extracts POS annotations from Ancient Greek CoNLL-U |
stanza-conllu-2-pos-lat.ipynb |
Extracts POS annotations from Latin CoNLL-U |
stanza-conllu-2-pos-chu.ipynb |
Extracts POS annotations from Church Slavonic CoNLL-U |
stanza-conllu-2-sqlite-grc.ipynb |
Stores Ancient Greek CoNLL-U data in SQLite |
- Python 3.x
- Stanza (
pip install stanza
) - SQLite3 for database handling
pip install stanza
The required language models can be downloaded using Stanza:
import stanza
stanza.download('grc') # Ancient Greek
stanza.download('la') # Latin
stanza.download('cu') # Church Slavonic
The Jupyter notebooks can be opened and executed using Jupyter Notebook or Jupyter Lab:
jupyter notebook
This project was created as part of a training session on Corpus Linguistics in Slavic Studies. If you have any questions or suggestions, feel free to open an issue or get in touch.