TEI Publisher Named Entity Recognition API

This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the input text as well as scripts to train entity recognition models. Named entity recognition is based on spaCy and python.

An article describing the main features has been published on e-editiones.org.

The project

serves the Named Entity Recognition API which is accessed by TEI Publisher to enrich TEI documents with auto-detected entities
provides scripts to train new models based on training data extracted from existing TEI documents in TEI Publisher

Installation

Note: as this feature is still under development, you need the (not yet released) version 8 or the master branch of TEI Publisher, which integrates the NER API into the web-annotation editor.

Install dependencies by running

pip3 install -r requirements.txt
Download one or more trained spaCy pipelines, e.g. for German:

python3 -m spacy download de_core_news_sm

Starting the Named Entity Recognition Service

We're using a spaCy project setup to orchestrate the different services and workflow steps. The setup is configured in project.yml, where you can change various variables. It also defines various commands and workflows. They can be executed using python3 -m spacy project run [name]. Commands are only re-run if their inputs have changed.

To start the Named Entity Recognition (NER) Service, run the following command:

python3 -m spacy project run serve

By default the service will listen on port 8001, which corresponds to the port TEI Publisher has configured. If you now open a document in TEI Publisher's annotation editor (or reload the browser window if you had one open), you should see that an additional button is enabled at the bottom right of the toolbar. This indicates that TEI Publisher was able to communicate with the NER service.

How Does it Work?

Whenever a user runs automatic entity detection

TEI Publisher extracts the plain text of a TEI document, remembering the original position of each text fragment within the TEI XML
The plain text is sent to the /entities endpoint of the named entity recognition API, which returns a JSON array of the entities found
TEI Publisher re-maps each received entity back to its position in the original TEI XML and creates an annotation, which is inserted into the web annotation editor

API Documentation

You can view the API documentation here: http://localhost:8001/docs

Training a Model

The default models provided by spaCy perform well on simple modern language texts, but may not produce adequate results on your particular edition. You may thus want to train a model based on a sample collection of texts you compiled. This requires that you have TEI documents which have already been semantically enriched with entity markup, e.g. by annotating them manually using TEI Publisher's annotation editor.

You can train a new model:

using TEI Publisher's web interface
via the command line

The first step is the same for both approaches: store the compiled sample documents into a collection below TEI Publisher's data collection (or reuse the existing annotate collection). The easiest way is to use eXide for creating a sub-collection (below /db/apps/tei-publisher/data) and uploading the compiled documents.

Using TEI Publisher's Web Interface

As logged in user, access the train-ner.html page in TEI Publisher directly or navigate to it via the admin menu.

Training via the command line

To train a model:

make sure that the variable training_collection in project.yml points to the sample collection you chose
run the all workflow to start the training

python3 -m spacy project run all

This will:

contact TEI Publisher's API endpoint to extract sample data from the documents in the collection. The sample data is essentially a list of text blocks and the position of entities occurring in those blocks.
convert the received sample data into the binary format required by spaCy
start the actual training

The result will be a new model stored into the models subdirectory. Now restart the NER service and you should see that the new model is picked up by TEI Publisher and offered for selection in the models dropdown within the annotation editor.

List of Commands

The following commands are defined by the project. They can be executed using spacy project run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`clean`	Remove auxiliary files and directories
`cleanall`	Remove auxiliary files and directories
`download`	Download a spaCy model with pretrained vectors
`convert`	Convert the data to spaCy's binary format
`convert.debug`	Convert the data to spaCy's binary format, also dumping the input JSON data into the training data output directory
`check`	Check the created training data sets
`create-config`	Create a new config with an NER pipeline component
`create-config-update`	Create a config, which updates the NER component of an existing pipeline, but keeps all other components
`train`	Train the NER model
`train-with-vectors`	Train the NER model with vectors
`evaluate`	Evaluate the model and export metrics
`package`	Package the trained model as a pip package
`visualize-model`	Visualize the model's output interactively using Streamlit
`serve`	Run the NER API as a service to be accessed by TEI Publisher

Workflows

The following workflows are defined by the project. They can be executed using spacy project run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`clean` → `convert` → `create-config` → `train`
`dev`	`clean` → `convert.debug` → `create-config` → `train`

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github		.github
assets		assets
scripts		scripts
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
project.yml		project.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEI Publisher Named Entity Recognition API

Installation

Starting the Named Entity Recognition Service

How Does it Work?

API Documentation

Training a Model

Using TEI Publisher's Web Interface

Training via the command line

List of Commands

Workflows

About

Releases

Sponsor this project

Packages

Contributors 5

Languages

eeditiones/tei-publisher-ner

Folders and files

Latest commit

History

Repository files navigation

TEI Publisher Named Entity Recognition API

Installation

Starting the Named Entity Recognition Service

How Does it Work?

API Documentation

Training a Model

Using TEI Publisher's Web Interface

Training via the command line

List of Commands

Workflows

About

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 5

Languages

Packages