Skip to content

Latin text dataset for machine learning and procedural text generation

License

Notifications You must be signed in to change notification settings

mathisve/LatinTextDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Latin Text Dataset

28.7 million+ character dataset containing latin texts for machine learning, language generation and analysation.

About

This is a small snippet of what the dataset looks like:

Cum venisset accitus praedicto die, advocato omni quod aderat commilitio, tribunali ad altiorem
suggestum erecto, quod aquilae circumdederunt et signa, Augustus insistens eumque manu retinens
dextera, haec sermone placido peroravit: Adsistimus apud illos, optimi rei publicae defensores, 
causae communi uno paene omnium spiritu vindicandae, quam acturus tamquam apud aequos iudices.

As you can see it's all authentic latin written in the roman times by historic figures such as: Ceasar, Augustus and many many more.

There are still certain kinks I have not been able to resolve such as the occasional title or capitalised roman numeral, but because the dataset is so large it shouldn't make a difference as its result is diluted enough for LSTM's (or GRU's) not to pick up on them.

All data and text originates from thelatinlibrary.com which is to my knowledge in public domain.

Getting Started

You can either use the pre-scraped and pre-processed file called latincorpus.txt or run / modify the main.py file and configure it to your liking! Scraping all the text data takes about 3-5 minutes on a computer with a moderately fast cpu and ethernet connection.

Prerequisites

The following libraries are required to run main.py, to install these automatically go to Installing down below.

selenium==3.141.0
beautifulsoup4==4.7.1
tqdm==4.31.1

Installing

To install the python libraries described above execute this command:

pip3 install -r requirements.txt

About

Latin text dataset for machine learning and procedural text generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages