CharacterBench: Benchmarking Character Customization of Large Language Models

Data Preparation

Using the provided test set, instruct the evaluated large language model to play specific characters for generating responses.
These generated responses will then be evaluated by CharacterJudge in subsequent evaluations.
Ensure that you update the model (YOUR_MODEL_NAME) and the path (data_path and output_path) as necessary.

python process.py --data_path eval_data/raw_data --output_path eval_data/response_data --model_name YOUR_MODEL_NAME

Convert the generated data into the input format of CharacterJudge.

cd construct_prompts
python process_wo_context_zh_all.py --data_path ../eval_data/response_data --output_path ../eval_data/evaluation_data_zh --model_name YOUR_MODEL_NAME
python process_wo_context_en_all.py --data_path ../eval_data/response_data --output_path ../eval_data/evaluation_data_en --model_name YOUR_MODEL_NAME

Evaluation

Run CharacterJudge to generate evaluation results.

bash run_zh.sh YOUR_MODEL_NAME
bash run_en.sh YOUR_MODEL_NAME

Citation

If you find our work useful for your research, please kindly cite our paper as follows:

@inproceedings{DBLP:conf/aaai/ZhouHWBCK0XPTZZ25,
  author       = {Jinfeng Zhou and
                  Yongkang Huang and
                  Bosi Wen and
                  Guanqun Bi and
                  Yuxuan Chen and
                  Pei Ke and
                  Zhuang Chen and
                  Xiyao Xiao and
                  Libiao Peng and
                  Kuntian Tang and
                  Rongsheng Zhang and
                  Le Zhang and
                  Tangjie Lv and
                  Zhipeng Hu and
                  Hongning Wang and
                  Minlie Huang},
  editor       = {Toby Walsh and
                  Julie Shah and
                  Zico Kolter},
  title        = {CharacterBench: Benchmarking Character Customization of Large Language
                  Models},
  booktitle    = {AAAI-25, Sponsored by the Association for the Advancement of Artificial
                  Intelligence, February 25 - March 4, 2025, Philadelphia, PA, {USA}},
  pages        = {26101--26110},
  publisher    = {{AAAI} Press},
  year         = {2025},
  url          = {https://doi.org/10.1609/aaai.v39i24.34806},
  doi          = {10.1609/AAAI.V39I24.34806},
  timestamp    = {Thu, 17 Apr 2025 17:08:58 +0200},
  biburl       = {https://dblp.org/rec/conf/aaai/ZhouHWBCK0XPTZZ25.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Contact Us

If you have any feedback for our work, please feel free to contact us ✉️ [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
construct_prompts		construct_prompts
eval_data/raw_data		eval_data/raw_data
results		results
README.md		README.md
all_eval.py		all_eval.py
gpt_call.py		gpt_call.py
model_generation_qwen_chat.py		model_generation_qwen_chat.py
process.py		process.py
roleplay_prompt.py		roleplay_prompt.py
run_en.sh		run_en.sh
run_zh.sh		run_zh.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CharacterBench: Benchmarking Character Customization of Large Language Models

Data Preparation

Evaluation

Citation

Contact Us

About

Uh oh!

Releases

Packages

Languages

thu-coai/CharacterBench

Folders and files

Latest commit

History

Repository files navigation

CharacterBench: Benchmarking Character Customization of Large Language Models

Data Preparation

Evaluation

Citation

Contact Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages