`TAX`onomic `I`nference benchmark dataset

This is a dataset to test the coherence of LLMs following model edits with respect to the properties of categories and their members. For instance, one edit is: "A Holstein is a kind of dog". And one test is: "A sound a Holstein makes is bark" (originally "moo").

Creating the datasets

Just run:

python3 build-datasets.py

Loading the data

edits_df = pd.read_json("datasets/edits.json")
baseline_df = pd.read_json("datasets/baseline-evaluation.json")
eval_df = pd.read_json("datasets/edits-evaluation.json")

Test query structure

The benchmark is multiple-choice with 2+ choices for all queries.

In light of the directionality of causal language models (predicting left to right), the dataset distinguishes between "forward" and "reverse" queries. A "forward" query is one where the edited subject is in the question prompt and an answer must be chosen. A "reverse" query is one where the edited subject is the anwer itself.

Forward: "A sound a Holstein makes is [bark / moo / tweet / hiss]"
Reverse: "Bark is a sound made by a [Holstein / Labrador / Siamese / Owlet]"

Dataset Building Project Structure

...-type-tokens.tsv: the table above
...-data.tsv: properties of animal types
build-datasets.py: creates edit and benchmark -evaluation datasets (baseline`` for unedited models and edits` for edited)

Requirements:

pandas
numpy
random

Benchmarks

To run the benchmarks, just run:

python3 benchmark.py

Editor Benchmarking Structure

The editor code is based on EasyEdit.

`custom` sub-module

I've added a custom submodule to EasyEdit with a few notable things:

EditedModel class: uses hparams like other EasyEditor classes. Allows for a separation of editing and evaluating logic.
- edit(): edit model with any method supported by EasyEdit. Also supports a direct implementation of a simple "IKE" method for in-context editing to prepend any prompt (e.g. "Imagine that ..."). Skips computation of metrics unlike the EasyEditor classes.
- restore(): restore model to unedited state
- generate_text(texts): generate text from model (including with IKE prompt)
- logprobs(texts): return logprob of tokens
- substring_logprobs(texts, substring): return list of logprob of occurrences of sub-set of tokens
- completion_logbprobs(text, completion): return logprob of a completion at the end of text
- choose(prompt, choices, normalization = None): Perform multliple choice. Returns integer specifying index of choice from list choices. Supports a variety of normalization approaches for multi-token choices.
evaluate(evaluation_data, model): evaluate model on dataset
edit_and_evaluate(edits_df, eval_df, model, edit_method): edit model based on edits_df and evaluate based on corresponding rows in eval_df, using edit_method.

HuggingFace credentials

Create a config.ini with the following format:

[hugging_face]
token=YOUR_TOKEN_HERE

Requirements

See environment.yml

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
datasets		datasets
easyeditor		easyeditor
hparams		hparams
human-tests		human-tests
results/csv		results/csv
taxonomy		taxonomy
.gitignore		.gitignore
LICENSE		LICENSE
LICENSE2		LICENSE2
README.md		README.md
analysis.ipynb		analysis.ipynb
benchmark.py		benchmark.py
build-datasets.py		build-datasets.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`TAX`onomic `I`nference benchmark dataset

Creating the datasets

Loading the data

Test query structure

Dataset Building Project Structure

Requirements:

Benchmarks

Editor Benchmarking Structure

`custom` sub-module

HuggingFace credentials

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

derekpowell/taxi

Folders and files

Latest commit

History

Repository files navigation

TAXonomic Inference benchmark dataset

Creating the datasets

Loading the data

Test query structure

Dataset Building Project Structure

Requirements:

Benchmarks

Editor Benchmarking Structure

custom sub-module

HuggingFace credentials

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`TAX`onomic `I`nference benchmark dataset

`custom` sub-module

Packages