Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"
Arxiv link of the paper: https://arxiv.org/abs/2105.07148
- Python 3.7.0
- Transformer 3.4.0
- Numpy 1.18.5
- Packaging 17.1
- skicit-learn 0.23.2
- torch 1.16.0+cu92
- tqdm 4.50.2
- multiprocess 0.70.10
- tensorflow 2.3.1
- tensorboardX 2.1
- seqeval 1.2.1
CoNLL format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
美 B-LOC
国 E-LOC
的 O
华 B-PER
莱 I-PER
士 E-PER
我 O
跟 O
他 O
谈 O
笑 O
风 O
生 O
Chinese BERT: https://cdn.huggingface.co/bert-base-chinese-pytorch_model.bin
Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz
- Weibo NER
- Ontonote4 NER
- MSRA NER
- Resume NER
- CTB5 POS
- CTB6 POS
- UD1 POS
- UD2 POS
- CTB6 CWS
- MSR CWS
- PKU CWS
- berts
- bert
- config.json
- vocab.txt
- pytorch_model.bin
- bert
- dataset
- NER
- note4
- msra
- resume
- POS
- ctb5
- ctb6
- ud1
- ud2
- CWS
- ctb6
- msr
- pku
- NER
- vocab
- tencent_vocab.txt, the vocab of pre-trained word embedding table.
- embedding
- word_embedding.txt
- result
- NER
- note4
- msra
- resume
- POS
- ctb5
- ctb6
- ud1
- ud2
- CWS
- ctb6
- msr
- pku
- NER
- log
-
1.Convert .char.bmes file to .json file,
python3 to_json.py
-
2.run the shell,
sh run_ner.sh
My model is trained in distribution mode so it can not be directly loaded by single-GPU mode. You can follow the below steps to revise the transformers before load my checkpoints.
-
Enter the source code director of Transformer,
cd source/transformers-master
-
Find the modeling_util.py, and positioned to about 995 lines
-
Compile the revised source code and install.
python3 setup.py install
@misc{liu2021lexicon,
title={Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter},
author={Wei Liu and Xiyan Fu and Yue Zhang and Wenming Xiao},
year={2021},
eprint={2105.07148},
archivePrefix={arXiv},
primaryClass={cs.CL}
}