dataset-references

The code in this repository is used to train and apply a Named Entity Recognition (NER) model to detect informal references to datasets in academic literature. The labeled data are derived from the ICPSR Bibliography of Data-Related Literature and the Semantic Scholar Open Research Corpus. This analysis supports the paper, A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature.

code/ner-demo.ipynb

Demonstration notebook of NER model applied to a paper

code/spacy-ner.ipynb

Training workflow for spaCy NER model using labeled data

config.cfg

NER model training parameters

data/

Datasets are sentences from academic articles named for sources from which they are derived. Training data were labeled, merged, and exported from Prodigy as of May 10, 2022 for use in spaCy with the following recipes:

prodigy db-in dataset_name /path/to/_data.jsonl
prodigy ner.manual dataset_name --label DATASET
prodigy data-to-spacy train --ner bibliography, paperpile, s2orc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataset-references

code/ner-demo.ipynb

code/spacy-ner.ipynb

config.cfg

data/

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md
config.cfg		config.cfg

License

ICPSR/dataset-references

Folders and files

Latest commit

History

Repository files navigation

dataset-references

code/ner-demo.ipynb

code/spacy-ner.ipynb

config.cfg

data/

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages