Skip to content

EMBEDDIA/stacked-ner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Run the code

BERT models need to be dowloaded (with the exception of CamemBERT)

Training:


CUDA_VISIBLE_DEVICES=1,2,3 python main.py 
--directory TEMP_MODEL # path to save the model; predictions on test/dev will be automatically saved here at the end of training
--pre_trained_model PRETRAINED_MODEL_NAME #bert-base-cased
--train_dataset train.tsv 
--test_dataset test.tsv 
--dev_dataset valid.tsv 
--batch_size 4 
--do_train 
--no_cpu 5
--language french #for CamemBERT; english for other models
--model stacked # or bert 
--num_layers 2 #2 Transformer layers

Predicting:


python main.py 
--directory TEMP_MODEL #same param as train.py
--pre_trained_model PRETRAINED_MODEL_NAME #same param as main.py
--train_dataset train.tsv #same param as main.py
--test_dataset test.tsv #same param as main.py
--dev_dataset valid.tsv #same param as main.py
--dataset_dir DIR_DATA_TEST #directory with .tsv to be predicted
--output_dir DIR_DATA_TEST_PREDICTIONS #directory where predictions will be saved
--batch_size 4 
--do_eval 
--saved_model TEMP_MODEL/best/best_ #best model after training
--no_cpu 5
--language french #for CamemBERT; english for other; same as main.py
--model stacked # or bert; same as main.py
--num_layers 2 #2 Transformer layers; same as main.py


Dataset Annotation
TOKEN	NE-COARSE-LIT	NE-COARSE-METO	NE-FINE-LIT	NE-FINE-METO	NE-FINE-COMP	NE-NESTED	NEL-LIT	NEL-METO	MISC
# language = fr
# newspaper = GDL
# date = 1878-02-22
# document_id = GDL-1878-02-22-a-i0014
# segment_iiif_link = _
LAUSANNE	B-loc	O	B-loc.adm.town	O	O	O	Q807	_	EndOfLine

On	O	O	O	O	O	O	_	_	_
nous	O	O	O	O	O	O	_	_	_
prie	O	O	O	O	O	O	_	_	_
de	O	O	O	O	O	O	_	_	_
faire	O	O	O	O	O	O	_	_	_
connaître	O	O	O	O	O	O	_	_	_
le	O	O	O	O	O	O	_	_	_
résultat	O	O	O	O	O	O	_	_	EndOfLine
Sécuniaire	O	O	O	O	O	O	_	_	_
des	O	O	O	O	O	O	_	_	_
quatre	O	O	O	O	O	O	_	_	_
conférences	O	O	O	O	O	O	_	_	_
sur	O	O	O	O	O	O	_	_	_
l'	O	O	O	O	O	O	_	_	NoSpaceAfter
Orient	B-loc	O	B-loc.adm.sup	O	O	O	Q205653	_	EndOfLine

M	B-pers	O	B-pers.ind	O	B-comp.title	O	Q123894	_	NoSpaceAfter
.	I-pers	O	I-pers.ind	O	I-comp.title	O	Q123894	_	_
le	I-pers	O	I-pers.ind	O	O	O	Q123894	_	_
professeur	I-pers	O	I-pers.ind	O	B-comp.function	O	Q123894	_	_
Gilliéron	I-pers	O	I-pers.ind	O	B-comp.name	O	Q123894	_	NoSpaceAfter
.	O	O	O	O	O	O	_	_	EndOfLine

Requirements

pip install -r requirements.txt

How to citate:

@inproceedings{boros2020robust,
  title={Robust named entity recognition and linking on historical multilingual documents},
  author={Boros, Emanuela and Pontes, Elvys Linhares and Cabrera-Diego, Luis Adri{\'a}n and Hamdi, Ahmed and Moreno, Jos{\'e} and Sid{\`e}re, Nicolas and Doucet, Antoine},
  booktitle={Conference and Labs of the Evaluation Forum (CLEF 2020)},
  volume={2696},
  number={Paper 171},
  pages={1--17},
  year={2020},
  organization={CEUR-WS Working Notes}
}
@inproceedings{borocs2020alleviating,
  title={Alleviating digitization errors in named entity recognition for historical documents},
  author={Boro{\c{s}}, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
  booktitle={Proceedings of the 24th Conference on Computational Natural Language Learning},
  pages={431--441},
  year={2020}
}