WG2-Sample

Repository for the use of WG2 in preparing their white paper on "Annotating European Novels for Distant Reading".

It should contain a total of 100 samples from each of at least 7 different ELTeC repositories, made up of 5 random passages of 400 whitespace-delimited tokens taken from each of 20 novels. Headings should be excluded, but not poetry and each sample should be a well formed XML fragment.

Samples were selected using the selector.xsl stylesheet, as follows:

generate a sequence of five random numbers in the range 1 to n, where n is the number of paragraphs in the body of a text (using www.random.org)
for each such number r, create a new <sample>, containing the rth and following paragraphs, such that the total word count is at least 400
if the end of a chapter or other division occurs before the required number of words have been copied, continue (but ignore any text not contained by a paragraph)
if the end of the text occurs before the required number of words have been copied, the sample generated is empty

All tagging except for the <p> delimiting each paragraph is removed. Each <p> uses its @n attribute to supply a locator made by concatenating the text identifier (value of TEI/@xml:id) and the paragraph sequence number.

Each set of five samples is stored in a file named [text-identifier]_sample.xml. All the files for each language are stored in a directory named for the language.

LB 2018-11-19

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
BAMTINOF_annotated		BAMTINOF_annotated
BAMTINOF_empty_samples		BAMTINOF_empty_samples
POS-LEMMA-ANNOTATED		POS-LEMMA-ANNOTATED
POS_Random_samples		POS_Random_samples
Scripts		Scripts
cze_deduplicated		cze_deduplicated
deu		deu
eng		eng
fra		fra
hun		hun
ita		ita
nor		nor
por-1		por-1
por		por
slv		slv
slv_deduplicated		slv_deduplicated
srp		srp
tokenized		tokenized
.gitignore		.gitignore
PORtokenized		PORtokenized
README.md		README.md
WG2-Sample-tagged.zip		WG2-Sample-tagged.zip
error_plotting_crosslingual_11.pdf		error_plotting_crosslingual_11.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WG2-Sample

About

Releases

Packages

Contributors 6

Languages

COST-ELTeC/WG2-Sample

Folders and files

Latest commit

History

Repository files navigation

WG2-Sample

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages