This repository contains the Finnish Corpus of Online REgisters (FinCORE) data introduced in the paper Toward Multilingual Identification of Online Registers (pdf).
Please note that this is the first, earlier version of FinCORE. The FINAL version is found at https://github.com/TurkuNLP/FinCORE_full.
FinCORE annotations are licensed under CC BY.
The software introduced in the paper is available from https://github.com/spyysalo/multiling-cnn.
The annotated texts are found in the files train.tsv
, dev.tsv
and
test.tsv
in the data/
subdirectory in a simple TSV format where each
line has the format LABEL<TAB>TEXT
.
To format the data for fastText, run
mkdir fasttext
for f in data/{train,dev,test}.tsv; do
perl -pe 's/^(.*?)\t/__label__$1 /' $f > fasttext/$(basename $f .tsv).ft
done
Download Finnish MUSE word vectors
wget https://dl.fbaipublicfiles.com/arrival/vectors/wiki.multi.fi.vec
Train fastText, predict label probabilities for test data
fasttext supervised -epoch 10 -pretrainedVectors wiki.multi.fi.vec -dim 300 \
-input fasttext/train.ft -output fasttext.model
fasttext predict-prob fasttext.model.bin fasttext/test.ft 6 > probs.txt
Evaluate
python scripts/auc.py fasttext/test.ft probs.txt