Skip to content

Latest commit

 

History

History
42 lines (29 loc) · 2.41 KB

README.md

File metadata and controls

42 lines (29 loc) · 2.41 KB

DOI

ELTeC-slv

This is the Slovenian novel collection for the ELTeC, the European Literary Text Collection, produced by the COST Action Distant Reading for European Literary History (CA16204, https://distant-reading.net).

Release notes

General information about ELTeC releases is available at https://github.com/COST-ELTeC/ELTeC.

  • v2.0.0, April 2020: 100 novels encoded at level 1 and level 2. DOI: https://doi.org/10.5281/10.5281/zenodo.4662600.
  • v0.7.1, November 2020: 100 novels encoded at level 1. DOI: https://doi.org/10.5281/10.5281/zenodo.4271648.
  • v0.7.0, October 2019: The ELTeC-slv collection contains 100 novels encoded at level 1. The corpus composition criteria are not fully fulfilled, as Slovene does not have enough novels to respect all the sampling criteria. It also contains test files for level 2.

Contributors

Licence

All texts included in this collection are in the public domain. The textual markup is provided with a Creative Commons Attribution International 4.0 licence (CC BY, https://creativecommons.org/licenses/by/4.0/).

Notes

The Orig/ directory contains (messy!) scripts to download the digital sources of the novels, add meta-data to them, and convert them to ELTeC level-1 encoding. Also included are various utility programs (XSLT, Perl), and the chain to annotate them into level-2 encoding and into vertical file for noSketch Engine.

The annotation tools are not included but are all open source.

Tokenisation and sentence segmentation was performed with ReLDI tokeniser while UD morphosyntactic tagging and lemmatisation with CLASSLA-StanfordNLP trained for Slovene. Named entities were annotated with Janes-NER, which uses the set of NE labels detailed in the Annotation guidelines for Slovenian named entities Janes-NER V1.1.