DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

✨ News (DICE-BENCH)

07/01/2025 - Our paper is now available on arXiv!
06/28/2025 - Dataset released on HuggingFace!
06/25/2025 - Initial public release of DICE-BENCH including data generation, scoring utilities, and vLLM inference scripts.
05/16/2025 - Our Paper DICE-BENCH is accepted to ACL 2025. See you in Vienna, Austria!

📖 Overview

DICE-BENCH is a benchmark that tests how well large language models can call external functions in realistic group-chat scenarios.

Key points at a glance:

DICE-BENCH synthesizes real group chats with a condition of four rounds and two to four speakers.
The released dataset contains 1,607 dialogues, and 124 distinct tools.
DICE-SCORE quantifies how difficult the given inputs are by quantifying dispersion of tool-clues throughout the input. Higher scores means the input is difficult.
Even GPT-4o averages only about 64 percent exact match, with performance falling as rounds or participants increase.
As the first benchmark to combine multi-round multi-party dialogue and inter-tool dependencies, DICE-BENCH provides fully open code, data, and pipeline.

📂 Directory Layout

Path	Description
`src/`	Core Python package (agents, prompts, utils, graphs, inference)
`data/`	Pre-generated sample datasets (`round_*.json`)
`scripts/`	Bash helpers to generate data & run inference
`outputs/`	Generated outputs (`all_rounds`, `selected_round`, `inf_vllm`). Note: the output files committed here are demo-sized samples only. Please visit the Hugging Face repository for the full dataset.

🛠️ Core Scripts

Script	Purpose	Key CLI flags / variables
`scripts/gen_all_round.sh`	Quickly generate asmall dataset across rounds 1–4, multiple agent numbers & domains.	`AGENT_NUM_LIST`, `DOMAIN_LIST`, `ROUND_LIST`, `DATASET_NUM`, outputs to `outputs/all_rounds/round_<n>.json`
`scripts/gen_selected_round.sh`	Generatemany samples for one specific round (`SELECTED_ROUND`).	`DATASET_NUM`, `SELECTED_ROUND`, outputs to `outputs/selected_round/round_<n>.json`
`scripts/inf_vllm.sh`	RunvLLM inference over generated dialogues.	`MODEL_NAME`, `FUNCTION_DOCS`, `MAX_TOKENS`, results in `outputs/inf_vllm/<model>/`

All scripts rely on uv to launch python modules reproducibly (uv run ...). Feel free to edit variables at the top of each file.

📁 Data Directory Explained

The repository ships with a sample dataset under data/sample/ so you can explore the JSON structure without running generation.

 data/
   ├── round_1.json          # full dataset (available at Huggingface)
   ├── round_2.json
   ├── ...
   └── sample/
        ├── round_1.json     # tiny subset (≈2 dialogues) for quick inspection
        └── ...

round_<n>.json – gold dialogues used for evaluation (can be regenerated).
sample/round_<n>.json – miniature versions bundled with git to keep the repo lightweight.

The tool graph and function docs used during generation live in src/graph/tool_graph.json and src/graph/tool_docs.json respectively.

🏃‍♂️ Quick Start

1. Environment (🛠 with uv)

uv is a super-fast Rust-based package manager & virtual-env tool that fully understands pyproject.toml. If you do not have it yet:

curl -Ls https://astral.sh/uv/install.sh | bash   # installs to ~/.cargo/bin/uv

Create the environment and install all dependencies with a single command:

# From repository root
uv init dicebench   # creates .venv and installs deps from pyproject.toml

Need an extra library? Just do:

uv add <package-name>

Fallback: you can still use plain pip, but all examples below assume uv.

2. Generate Synthetic Dialogues

cd scripts
./gen_all_round.sh       # all rounds, small size (≈ a few minutes)
./gen_selected_round.sh  # generate many samples for a single round

Outputs are written under outputs/all_rounds/ and outputs/selected_round/ respectively.

3. Run vLLM Inference

cd scripts
./inf_vllm.sh            # requires CUDA + vLLM installation

Results will appear in outputs/inf_vllm/<model_name>/.

🔄 Experiment Steps

Prepare Data - use the generation scripts above or supply your own tool-graph JSON.
Fine-tune / Inference - leverage src/inference/inference_vllm.py for fast decoding.
Evaluate - employ src/get_dice_score.py to calculate the DICE metric.

Detailed configs (model path, dataset size, TP degree, etc.) can be edited directly in each bash script or via CLI flags.

📜 Citation

@misc{jang2025dicebenchevaluatingtoolusecapabilities,
  title={DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues},
  author={Kyochul Jang and Donghyeon Lee and Kyusik Kim and Dongseok Heo and Taewhoo Lee and Woojeong Kim and Bongwon Suh},
  year={2025},
  eprint={2506.22853},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.22853},
}

🤝 Contact & Contributing

Questions / ideas? Open an issue or email [email protected]. Pull-requests are welcome!

Please visit to kyochul[dot]com for more information about the first author!

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data/sample		data/sample
media		media
outputs		outputs
scripts		scripts
src		src
static		static
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

✨ News (DICE-BENCH)

📖 Overview

📂 Directory Layout

🛠️ Core Scripts

📁 Data Directory Explained

🏃‍♂️ Quick Start

1. Environment (🛠 with uv)

2. Generate Synthetic Dialogues

3. Run vLLM Inference

🔄 Experiment Steps

📜 Citation

🤝 Contact & Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

snuhcc/DICE-Bench

Folders and files

Latest commit

History

Repository files navigation

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

✨ News (DICE-BENCH)

📖 Overview

📂 Directory Layout

🛠️ Core Scripts

📁 Data Directory Explained

🏃‍♂️ Quick Start

1. Environment (🛠 with uv)

2. Generate Synthetic Dialogues

3. Run vLLM Inference

🔄 Experiment Steps

📜 Citation

🤝 Contact & Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages