This project implements a solution to the "X" label issue (e.g., #148, #422) of NER task in Google's BERT paper, and is developed mostly based on lemonhu's work and bheinzerling's suggestion.
- Chinese: MSRA, which is reported to be incomplete. A complete version can be found here.
- English: CONLL-2003
This repo was tested on Python 3.6+ and PyTorch 1.3.1. The main requirements are:
- nltk
- tqdm
- pytorch >= 1.3.1
- 🤗transformers == 2.2.2
- tensorflow == 1.11.0 (Optional)
Note: The tensorflow library is only used for the conversion of pretrained models from TensorFlow to PyTorch.
-
Download and unzip the Chinese (English) NER model weights under
experiments/msra(conll)/
, then run:python build_dataset_tags.py --dataset=msra python interactive.py --dataset=msra
to try it out and interact with the pretrained NER model.
-
Get BERT model for PyTorch
There are two ways to get the pretrained BERT model in a PyTorch dump for your experiments :
-
[Automatically] Download the specified pretrained BERT model provided by huggingface🤗
-
[Manually] Convert the TensorFlow checkpoint to a PyTorch dump
-
Download the Google's BERT pretrained models for Chinese (
BERT-Base, Chinese
) and English (BERT-Base, Cased
). Then decompress them underpretrained_bert_models/bert-chinese-cased/
andpretrained_bert_models/bert-base-cased/
respectively. More pre-trained models are available here. -
Execute the following command, convert the TensorFlow checkpoint to a PyTorch dump as huggingface suggests. Here is an example of the conversion process for a pretrained
BERT-Base Cased
model.export TF_BERT_MODEL_DIR=/full/path/to/cased_L-12_H-768_A-12 export PT_BERT_MODEL_DIR=/full/path/to/pretrained_bert_models/bert-base-cased transformers bert \ $TF_BERT_MODEL_DIR/bert_model.ckpt \ $TF_BERT_MODEL_DIR/bert_config.json \ $PT_BERT_MODEL_DIR/pytorch_model.bin
-
Copy the BERT parameters file
bert_config.json
and dictionary filevocab.txt
to the directory$PT_BERT_MODEL_DIR
.cp $TF_BERT_MODEL_DIR/bert_config.json $PT_BERT_MODEL_DIR/config.json cp $TF_BERT_MODEL_DIR/vocab.txt $PT_BERT_MODEL_DIR/vocab.txt
-
-
-
Build dataset and tags
if you use default parameters (using CONLL-2003 dataset as default) , just run
python build_dataset_tags.py
Or specify dataset (e.g., MSRA) and other parameters on the command line
python build_dataset_tags.py --dataset=msra
It will extract the sentences and tags from
train_bio
,test_bio
andval_bio
(if not provided, it will randomly sample 5% data from thetrain_bio
to createval_bio
). Then split them into train/val/test and save them in a convenient format for our model, and create a filetags.txt
containing a collection of tags. -
Set experimental hyperparameters
We created directories with the same name as datasets under the
experiments
directory. It contains a fileparams.json
which sets the hyperparameters for the experiment. It looks like{ "full_finetuning": true, "max_len": 180, "learning_rate": 5e-5, "weight_decay": 0.01, "clip_grad": 5, }
For different datasets, you will need to create a new directory under
experiments
withparams.json
. -
Train and evaluate the model
if you use default parameters (using CONLL-2003 dataset as default) , just run
python train.py
Or specify dataset (e.g., MSRA) and other parameters on the command line
python train.py --dataset=msra
A proper pretrained BERT model will be automatically chosen according to the language of the specified dataset. It will instantiate a model and train it on the training set following the hyper-parameters specified in
params.json
. It will also evaluate some metrics on the development set. -
Evaluation on the test set
Once you've run many experiments and selected your best model and hyperparameters based on the performance on the development set, you can finally evaluate the performance of your model on the test set.
if you use default parameters (using CONLL-2003 dataset as default) , just run
python evaluate.py
Or specify dataset (e.g., MSRA) and other parameters on the command line
python evaluate.py --dataset=msra