-
The queries used for our main experiments are sampled from 5 datasets and can be found at
data/all_data_latest_filtered.jsonl
. -
The autorater judgements for all three model pairs are available at
data/autorater_judgements
. -
The human judgements for all three model pairs are available at
data/human_judgements
.
- Download all the queries from all 5 datasets (
ChatBot Arena
,MTBench
,AlpacaEval
,ExpertQA
,KIWI
).python3 main/download_data.py
- Filter only well-formed queries and sample a fixed number of queries from each dataset.
python3 main/filter_queries.py
Example scripts for running context generation are at bash_scripts/run_context_generation.sh
.
-
Generate contexts for all queries.
python3 main/generate_contexts.py
-
Generate validation labels for generated contexts.
python3 main/generate_context_validation.py
-
Generate a single instance of context for each query (follow-up question with single answer).
python3 main/generate_single_context.py
-
Example scripts for running response generation are at
bash_scripts/run_response_generation.sh
. -
Generate model responses with and without context.
python3 main/generate_responses.py
. Use--w_context=False
for context-agnostic generation and--w_context=True
for context-aware generation.
-
Example scripts for generating pairwise evaluation judgements are at
bash_scripts/run_eval_generation.sh
.- For the setting
CtxGen-CtxEval
, you would need to provide responses that are generated with context and forNoCtxGen-NoCtxEval
andNoCtxGen-CtxEval
, you would need to provide responses generated without context. ForCtxGen-CtxEval
andNoCtxGen-CtxEval
, you would setW_CONTEXT=True
while you would setW_CONTEXT=False
forNoCtxGen-NoCtxEval
.
- For the setting
-
Generate pairwise evaluation judgements using the following script.
python3 main/generate_pairwise_evals.py
-
Computing win rates and agreement based on autorater judgments
python3 main/compute_autorater_agreement.py
.
-
Computing win rates and agreement based on human judgments
python3 main/compute_human_agreement.py
.
- Example scripts for running this analysis are at
bash_scripts/default_response_analysis.sh
.
- Example scripts for running this analysis are at
bash_scripts/adapted_response_analysis.sh
.
- To generate types of each query based on the degree / type of underspecification, use the script
main/generate_query_types.py
. - To compute the number of constraints (follow-up QAs) satisfied by each response, use the script
main/eval_num_constraints.py
. - To codify autorater justifications, use the script
main/code_model_judgements.py
and to codify human justifications, use the scriptmain/code_human_judgements.py
.
@inproceedings{malaviya2024contexteval,
author = {Malaviya, Chaitanya and Chee Chang, Joseph and Roth, Dan and Iyyer, Mohit and Yatskar, Mark and Lo, Kyle},
title = {Contextualized Evaluations: Judging Language Model Responses to Underspecified Queries},
journal = {Transactions of the Association for Computational Linguistics},
year = {2025},
url = "https://arxiv.org/abs/2411.07237"
}