GitHub

Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

Data Overview

The queries used for our main experiments are sampled from 5 datasets and can be found at data/all_data_latest_filtered.jsonl.
The autorater judgements for all three model pairs are available at data/autorater_judgements.
The human judgements for all three model pairs are available at data/human_judgements.

Code Overview

Data Downloads

Download all the queries from all 5 datasets (ChatBot Arena, MTBench, AlpacaEval, ExpertQA, KIWI).
- python3 main/download_data.py
Filter only well-formed queries and sample a fixed number of queries from each dataset.
- python3 main/filter_queries.py

Context Generation

Example scripts for running context generation are at bash_scripts/run_context_generation.sh.

Generate contexts for all queries.
- python3 main/generate_contexts.py
Generate validation labels for generated contexts.
- python3 main/generate_context_validation.py
Generate a single instance of context for each query (follow-up question with single answer).
- python3 main/generate_single_context.py

Response Generation

Example scripts for running response generation are at bash_scripts/run_response_generation.sh.
Generate model responses with and without context.
- python3 main/generate_responses.py. Use --w_context=False for context-agnostic generation and --w_context=True for context-aware generation.

Autorater Evaluation Judgements

Example scripts for generating pairwise evaluation judgements are at bash_scripts/run_eval_generation.sh.
- For the setting CtxGen-CtxEval, you would need to provide responses that are generated with context and for NoCtxGen-NoCtxEval and NoCtxGen-CtxEval, you would need to provide responses generated without context. For CtxGen-CtxEval and NoCtxGen-CtxEval, you would set W_CONTEXT=True while you would set W_CONTEXT=False for NoCtxGen-NoCtxEval.
Generate pairwise evaluation judgements using the following script.
- python3 main/generate_pairwise_evals.py

Win Rate and Agreement Calculation

Computing win rates and agreement based on autorater judgments
- python3 main/compute_autorater_agreement.py.
Computing win rates and agreement based on human judgments
- python3 main/compute_human_agreement.py.

What Contexts are "Default"?

Example scripts for running this analysis are at bash_scripts/default_response_analysis.sh.

Which Contexts are Harder to Follow?

Example scripts for running this analysis are at bash_scripts/adapted_response_analysis.sh.

Miscellaneous Scripts

To generate types of each query based on the degree / type of underspecification, use the script main/generate_query_types.py.
To compute the number of constraints (follow-up QAs) satisfied by each response, use the script main/eval_num_constraints.py.
To codify autorater justifications, use the script main/code_model_judgements.py and to codify human justifications, use the script main/code_human_judgements.py.

Citation

@inproceedings{malaviya2024contexteval,
   author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle},
   title = {Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries},
   journal = {Transactions of the Association for Computational Linguistics},
   year = {2025},
   url = "https://arxiv.org/abs/2411.07237"
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
contexteval		contexteval
docs		docs
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
RELEASE_PROCESS.md		RELEASE_PROCESS.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

Data Overview

Code Overview

Data Downloads

Context Generation

Response Generation

Autorater Evaluation Judgements

Win Rate and Agreement Calculation

What Contexts are "Default"?

Which Contexts are Harder to Follow?

Miscellaneous Scripts

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

allenai/ContextEval

Folders and files

Latest commit

History

Repository files navigation

Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries

Data Overview

Code Overview

Data Downloads

Context Generation

Response Generation

Autorater Evaluation Judgements

Win Rate and Agreement Calculation

What Contexts are "Default"?

Which Contexts are Harder to Follow?

Miscellaneous Scripts

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages