This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the input text as well as scripts to train entity recognition models. Named entity recognition is based on spaCy and python.
An article describing the main features has been published on e-editiones.org.
The project
- serves the Named Entity Recognition API which is accessed by TEI Publisher to enrich TEI documents with auto-detected entities
- provides scripts to train new models based on training data extracted from existing TEI documents in TEI Publisher
Note: as this feature is still under development, you need the (not yet released) version 8 or the master
branch of TEI Publisher, which integrates the NER API into the web-annotation editor.
-
Install dependencies by running
pip3 install -r requirements.txt
-
Download one or more trained spaCy pipelines, e.g. for German:
python3 -m spacy download de_core_news_sm
We're using a spaCy project setup to orchestrate the different services and workflow steps. The setup is configured in project.yml
, where you can change various variables. It also defines various commands and workflows. They can be executed using python3 -m spacy project run [name]
. Commands are only re-run if their inputs have changed.
To start the Named Entity Recognition (NER) Service, run the following command:
python3 -m spacy project run serve
By default the service will listen on port 8001, which corresponds to the port TEI Publisher has configured. If you now open a document in TEI Publisher's annotation editor (or reload the browser window if you had one open), you should see that an additional button is enabled at the bottom right of the toolbar. This indicates that TEI Publisher was able to communicate with the NER service.
Whenever a user runs automatic entity detection
- TEI Publisher extracts the plain text of a TEI document, remembering the original position of each text fragment within the TEI XML
- The plain text is sent to the
/entities
endpoint of the named entity recognition API, which returns a JSON array of the entities found - TEI Publisher re-maps each received entity back to its position in the original TEI XML and creates an annotation, which is inserted into the web annotation editor
You can view the API documentation here: http://localhost:8001/docs
The default models provided by spaCy perform well on simple modern language texts, but may not produce adequate results on your particular edition. You may thus want to train a model based on a sample collection of texts you compiled. This requires that you have TEI documents which have already been semantically enriched with entity markup, e.g. by annotating them manually using TEI Publisher's annotation editor.
You can train a new model:
- using TEI Publisher's web interface
- via the command line
The first step is the same for both approaches: store the compiled sample documents into a collection below TEI Publisher's data
collection (or reuse the existing annotate
collection). The easiest way is to use eXide for creating a sub-collection (below /db/apps/tei-publisher/data
) and uploading the compiled documents.
As logged in user, access the train-ner.html
page in TEI Publisher directly or navigate to it via the admin
menu.
To train a model:
- make sure that the variable
training_collection
inproject.yml
points to the sample collection you chose - run the
all
workflow to start the training
python3 -m spacy project run all
This will:
- contact TEI Publisher's API endpoint to extract sample data from the documents in the collection. The sample data is essentially a list of text blocks and the position of entities occurring in those blocks.
- convert the received sample data into the binary format required by spaCy
- start the actual training
The result will be a new model stored into the models
subdirectory. Now restart the NER service and you should see that the new model is picked up by TEI Publisher and offered for selection in the models dropdown within the annotation editor.
The following commands are defined by the project. They
can be executed using spacy project run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
clean |
Remove auxiliary files and directories |
cleanall |
Remove auxiliary files and directories |
download |
Download a spaCy model with pretrained vectors |
convert |
Convert the data to spaCy's binary format |
convert.debug |
Convert the data to spaCy's binary format, also dumping the input JSON data into the training data output directory |
check |
Check the created training data sets |
create-config |
Create a new config with an NER pipeline component |
create-config-update |
Create a config, which updates the NER component of an existing pipeline, but keeps all other components |
train |
Train the NER model |
train-with-vectors |
Train the NER model with vectors |
evaluate |
Evaluate the model and export metrics |
package |
Package the trained model as a pip package |
visualize-model |
Visualize the model's output interactively using Streamlit |
serve |
Run the NER API as a service to be accessed by TEI Publisher |
The following workflows are defined by the project. They
can be executed using spacy project run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
clean → convert → create-config → train |
dev |
clean → convert.debug → create-config → train |