Stanza-Conllu-2Corpus

Convert text files with Stanza (NLP) to AntConc-Format for different languages

Stanza-based Processing of Church Slavonic and Classical Texts

This repository contains a collection of scripts for the automatic processing and annotation of Church Slavonic, Latin, and Ancient Greek corpus materials using Stanza. These scripts were developed as part of the training on Corpus Linguistics in Slavic Studies (Online-Workshop 1, 2)and enable the automatic analysis of texts with modern NLP methods.

Overview

The scripts cover the entire text processing pipeline:

TXT → CoNLL-U: Conversion of plain text files into CoNLL-U format with lemmatization, POS tagging, and dependency parsing.
CoNLL-U → POS: Extraction of POS-tagged text from CoNLL-U files for a simplified view.
CoNLL-U → SQLite: Storage of CoNLL-U data in an SQLite database for structured analysis.

Included Scripts

File	Description
`stanza-txt-2-conllu-grc.ipynb`	Converts Ancient Greek texts into CoNLL-U format
`stanza-txt-2-conllu-lat.ipynb`	Converts Latin texts into CoNLL-U format
`stanza-txt-2-conllu-chu.ipynb`	Converts Church Slavonic texts into CoNLL-U format
`stanza-conllu-2-pos-grc.ipynb`	Extracts POS annotations from Ancient Greek CoNLL-U
`stanza-conllu-2-pos-lat.ipynb`	Extracts POS annotations from Latin CoNLL-U
`stanza-conllu-2-pos-chu.ipynb`	Extracts POS annotations from Church Slavonic CoNLL-U
`stanza-conllu-2-sqlite-grc.ipynb`	Stores Ancient Greek CoNLL-U data in SQLite

Requirements

Python 3.x
Stanza (pip install stanza)
SQLite3 for database handling

Usage

1. Install dependencies

pip install stanza

2. Download Stanza models

The required language models can be downloaded using Stanza:

import stanza
stanza.download('grc')  # Ancient Greek
stanza.download('la')   # Latin
stanza.download('cu')   # Church Slavonic

3. Run the scripts

The Jupyter notebooks can be opened and executed using Jupyter Notebook or Jupyter Lab:

jupyter notebook

Contact

This project was created as part of a training session on Corpus Linguistics in Slavic Studies. If you have any questions or suggestions, feel free to open an issue or get in touch.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.ipynb_checkpoints		.ipynb_checkpoints
test-chu-out		test-chu-out
test-chu/txt		test-chu/txt
test-grc-out		test-grc-out
test-grc/txt		test-grc/txt
test-lat-out		test-lat-out
test-lat/txt		test-lat/txt
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
stanza-conllu-2-pos-chu.ipynb		stanza-conllu-2-pos-chu.ipynb
stanza-conllu-2-pos-grc.ipynb		stanza-conllu-2-pos-grc.ipynb
stanza-conllu-2-pos-lat.ipynb		stanza-conllu-2-pos-lat.ipynb
stanza-conllu-2-sqlite-grc.ipynb		stanza-conllu-2-sqlite-grc.ipynb
stanza-txt-2-conllu-chu.ipynb		stanza-txt-2-conllu-chu.ipynb
stanza-txt-2-conllu-grc.ipynb		stanza-txt-2-conllu-grc.ipynb
stanza-txt-2-conllu-lat.ipynb		stanza-txt-2-conllu-lat.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stanza-Conllu-2Corpus

Stanza-based Processing of Church Slavonic and Classical Texts

Overview

Included Scripts

Requirements

Usage

1. Install dependencies

2. Download Stanza models

3. Run the scripts

Contact

About

Uh oh!

Releases

Packages

Languages

License

StabiBerlin/Stanza-Conllu-2Corpus

Folders and files

Latest commit

History

Repository files navigation

Stanza-Conllu-2Corpus

Stanza-based Processing of Church Slavonic and Classical Texts

Overview

Included Scripts

Requirements

Usage

1. Install dependencies

2. Download Stanza models

3. Run the scripts

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages