The code in this repository is used to train and apply a Named Entity Recognition (NER) model to detect informal references to datasets in academic literature. The labeled data are derived from the ICPSR Bibliography of Data-Related Literature and the Semantic Scholar Open Research Corpus. This analysis supports the paper, A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature.
Demonstration notebook of NER model applied to a paper
Training workflow for spaCy NER model using labeled data
NER model training parameters
Datasets are sentences from academic articles named for sources from which they are derived. Training data were labeled, merged, and exported from Prodigy as of May 10, 2022 for use in spaCy with the following recipes:
- prodigy db-in dataset_name /path/to/_data.jsonl
- prodigy ner.manual dataset_name --label DATASET
- prodigy data-to-spacy train --ner bibliography, paperpile, s2orc