DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
- 07/01/2025 - Our paper is now available on arXiv!
- 06/28/2025 - Dataset released on HuggingFace!
- 06/25/2025 - Initial public release of DICE-BENCH including data generation, scoring utilities, and vLLM inference scripts.
- 05/16/2025 - Our Paper DICE-BENCH is accepted to ACL 2025. See you in Vienna, Austria!
DICE-BENCH is a benchmark that tests how well large language models can call external functions in realistic group-chat scenarios.
Key points at a glance:
- DICE-BENCH synthesizes real group chats with a condition of four rounds and two to four speakers.
- The released dataset contains 1,607 dialogues, and 124 distinct tools.
- DICE-SCORE quantifies how difficult the given inputs are by quantifying dispersion of tool-clues throughout the input. Higher scores means the input is difficult.
- Even GPT-4o averages only about 64 percent exact match, with performance falling as rounds or participants increase.
- As the first benchmark to combine multi-round multi-party dialogue and inter-tool dependencies, DICE-BENCH provides fully open code, data, and pipeline.
Path | Description |
---|---|
src/ |
Core Python package (agents, prompts, utils, graphs, inference) |
data/ |
Pre-generated sample datasets (round_*.json ) |
scripts/ |
Bash helpers to generate data & run inference |
outputs/ |
Generated outputs (all_rounds , selected_round , inf_vllm ). Note: the output files committed here are demo-sized samples only. Please visit the Hugging Face repository for the full dataset. |
Script | Purpose | Key CLI flags / variables |
---|---|---|
scripts/gen_all_round.sh |
Quickly generate asmall dataset across rounds 1–4, multiple agent numbers & domains. | AGENT_NUM_LIST , DOMAIN_LIST , ROUND_LIST , DATASET_NUM , outputs to outputs/all_rounds/round_<n>.json |
scripts/gen_selected_round.sh |
Generatemany samples for one specific round (SELECTED_ROUND ). |
DATASET_NUM , SELECTED_ROUND , outputs to outputs/selected_round/round_<n>.json |
scripts/inf_vllm.sh |
RunvLLM inference over generated dialogues. | MODEL_NAME , FUNCTION_DOCS , MAX_TOKENS , results in outputs/inf_vllm/<model>/ |
All scripts rely on uv
to launch python modules reproducibly (uv run ...
). Feel free to edit variables at the top of each file.
The repository ships with a sample dataset under data/sample/
so you can explore the JSON structure without running generation.
data/
├── round_1.json # full dataset (available at Huggingface)
├── round_2.json
├── ...
└── sample/
├── round_1.json # tiny subset (≈2 dialogues) for quick inspection
└── ...
round_<n>.json
– gold dialogues used for evaluation (can be regenerated).sample/round_<n>.json
– miniature versions bundled with git to keep the repo lightweight.
The tool graph and function docs used during generation live in src/graph/tool_graph.json
and src/graph/tool_docs.json
respectively.
uv
is a super-fast Rust-based package manager & virtual-env tool that fully understands pyproject.toml.
If you do not have it yet:
curl -Ls https://astral.sh/uv/install.sh | bash # installs to ~/.cargo/bin/uv
Create the environment and install all dependencies with a single command:
# From repository root
uv init dicebench # creates .venv and installs deps from pyproject.toml
Need an extra library? Just do:
uv add <package-name>
Fallback: you can still use plain pip
, but all examples below assume uv.
cd scripts
./gen_all_round.sh # all rounds, small size (≈ a few minutes)
./gen_selected_round.sh # generate many samples for a single round
Outputs are written under outputs/all_rounds/
and outputs/selected_round/
respectively.
cd scripts
./inf_vllm.sh # requires CUDA + vLLM installation
Results will appear in outputs/inf_vllm/<model_name>/
.
- Prepare Data - use the generation scripts above or supply your own tool-graph JSON.
- Fine-tune / Inference - leverage
src/inference/inference_vllm.py
for fast decoding. - Evaluate - employ
src/get_dice_score.py
to calculate the DICE metric.
Detailed configs (model path, dataset size, TP degree, etc.) can be edited directly in each bash script or via CLI flags.
@misc{jang2025dicebenchevaluatingtoolusecapabilities,
title={DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues},
author={Kyochul Jang and Donghyeon Lee and Kyusik Kim and Dongseok Heo and Taewhoo Lee and Woojeong Kim and Bongwon Suh},
year={2025},
eprint={2506.22853},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.22853},
}
Questions / ideas? Open an issue or email [email protected]. Pull-requests are welcome!
Please visit to kyochul[dot]com for more information about the first author!