This is the Slovenian novel collection for the ELTeC, the European Literary Text Collection, produced by the COST Action Distant Reading for European Literary History (CA16204, https://distant-reading.net).
General information about ELTeC releases is available at https://github.com/COST-ELTeC/ELTeC.
- v2.0.0, April 2020: 100 novels encoded at level 1 and level 2. DOI: https://doi.org/10.5281/10.5281/zenodo.4662600.
- v0.7.1, November 2020: 100 novels encoded at level 1. DOI: https://doi.org/10.5281/10.5281/zenodo.4271648.
- v0.7.0, October 2019: The ELTeC-slv collection contains 100 novels encoded at level 1. The corpus composition criteria are not fully fulfilled, as Slovene does not have enough novels to respect all the sampling criteria. It also contains test files for level 2.
- Collection editor(s): Tomaž Erjavec, Miran Hladnik, Marko Juvan, Katja Mihurko Poniž
- Sources: Wikivir either directly or via the IMP Digital Library and eZISS for one novel (Izidor Cankar: S poti)
All texts included in this collection are in the public domain. The textual markup is provided with a Creative Commons Attribution International 4.0 licence (CC BY, https://creativecommons.org/licenses/by/4.0/).
The Orig/ directory contains (messy!) scripts to download the digital sources of the novels, add meta-data to them, and convert them to ELTeC level-1 encoding. Also included are various utility programs (XSLT, Perl), and the chain to annotate them into level-2 encoding and into vertical file for noSketch Engine.
The annotation tools are not included but are all open source.
Tokenisation and sentence segmentation was performed with ReLDI tokeniser while UD morphosyntactic tagging and lemmatisation with CLASSLA-StanfordNLP trained for Slovene. Named entities were annotated with Janes-NER, which uses the set of NE labels detailed in the Annotation guidelines for Slovenian named entities Janes-NER V1.1.