This directory contains the datasets and scripts for an example project using Prodigy to train a binary text classifier with exclusive classes to predict whether a GitHub issue title is about documentation.
We've limited our experiments to spaCy, but you can use the annotations in any other text classification system instead. If you run the experiments, please let us know! Feel free to submit a pull request with your scripts.
Model | F-Score | # Examples |
---|---|---|
spaCy blank |
88.8 | 661 |
spaCyen_vectors_web_lg |
91.9 | 661 |
Labelling the data with Prodigy took about two hours and was done manually using the binary classification interface. The raw text was sourced from the from the GitHub API using the search queries "docs"
, "documentation"
, "readme"
and "instructions"
.
File | Count | Description |
---|---|---|
docs_issues_training.jsonl |
661 | Training data annotated with DOCUMENTATION label. |
docs_issues_eval.jsonl |
500 | Evaluation data annotated with DOCUMENTATION label. |
The training and evaluation datasets are distributed in Prodigy's simple JSONL (newline-delimited JSON) format. Each entry contains a "text"
, the "label"
and an "answer"
("accept"
if the label applies, "reject"
if it doesn't apply). Here are two simplified example entries:
{
"text": "Add FAQ's to the documentation",
"label": "DOCUMENTATION",
"answer": "accept"
}
{
"text": "Proposal: deprecate SQTagUtil.java",
"label": "DOCUMENTATION",
"answer": "reject"
}
prodigy mark docs_issues_data ./raw_text.jsonl --label DOCUMENTATION --view-id classification
We also trained a model using Allen AI's Autocat app (a web-based tool for training, visualizing and showcasing spaCy text classification models). You can try out the classifier in real-time and see the updated predictions as you type. You can also evaluate it on your own data, download the model Python package or just pip install
it with one command to try it locally. View model here.
To use the JSONL data in Autocat, we added "labels": ["DOCUMENTATION"]
to all examples with "answer": "accept"
and "labels": ["N/A"]
to all examples with "answer": "reject"
.
The scripts_spacy.py
file includes command line scripts for training and evaluating spaCy models using the data in Prodigy's format. This should let you reproduce our results. We tried to keep the scripts as straightforward as possible. To see the available arguments, you can run python scripts_spacy.py [command] --help
.
Command | Description |
---|---|
train |
Train a model from Prodigy annotations. Will optionally save the best model to disk. |
evaluate |
Evaluate a trained model on Prodigy annotations and print the accuracy. |