This is a dataset to test the coherence of LLMs following model edits with respect to the properties of categories and their members. For instance, one edit is: "A Holstein is a kind of dog". And one test is: "A sound a Holstein makes is bark" (originally "moo").
Just run:
python3 build-datasets.py
edits_df = pd.read_json("datasets/edits.json")
baseline_df = pd.read_json("datasets/baseline-evaluation.json")
eval_df = pd.read_json("datasets/edits-evaluation.json")
The benchmark is multiple-choice with 2+ choices for all queries.
In light of the directionality of causal language models (predicting left to right), the dataset distinguishes between "forward" and "reverse" queries. A "forward" query is one where the edited subject is in the question prompt and an answer must be chosen. A "reverse" query is one where the edited subject is the anwer itself.
- Forward: "A sound a Holstein makes is [bark / moo / tweet / hiss]"
- Reverse: "Bark is a sound made by a [Holstein / Labrador / Siamese / Owlet]"
...-type-tokens.tsv
: the table above...-data.tsv
: properties of animal typesbuild-datasets.py
: creates edit and benchmark-evaluation
datasets (baseline`` for unedited models and
edits` for edited)
- pandas
- numpy
- random
To run the benchmarks, just run:
python3 benchmark.py
The editor code is based on EasyEdit
.
I've added a custom
submodule to EasyEdit
with a few notable things:
EditedModel
class: useshparams
like other EasyEditor classes. Allows for a separation of editing and evaluating logic.edit()
: edit model with any method supported by EasyEdit. Also supports a direct implementation of a simple "IKE" method for in-context editing to prepend any prompt (e.g. "Imagine that ..."). Skips computation of metrics unlike the EasyEditor classes.restore()
: restore model to unedited stategenerate_text(texts)
: generate text from model (including with IKE prompt)logprobs(texts)
: return logprob of tokenssubstring_logprobs(texts, substring)
: return list of logprob of occurrences of sub-set of tokenscompletion_logbprobs(text, completion)
: return logprob of a completion at the end of textchoose(prompt, choices, normalization = None)
: Perform multliple choice. Returns integer specifying index of choice from listchoices
. Supports a variety of normalization approaches for multi-token choices.
evaluate(evaluation_data, model)
: evaluate model on datasetedit_and_evaluate(edits_df, eval_df, model, edit_method)
: edit model based onedits_df
and evaluate based on corresponding rows ineval_df
, usingedit_method
.
Create a config.ini
with the following format:
[hugging_face]
token=YOUR_TOKEN_HERE
See environment.yml