RepoMod-Bench evaluates AI agents' ability to translate software projects between languages/frameworks while maintaining functional equivalence. It supports:
- 21 benchmarks across CLI tools and REST APIs
- 8 programming languages: C, C++, Go, Java, JavaScript, Python, Rust, TypeScript
- 1.6M lines of code total, with repositories ranging from 14 to 211K LOC
- 11,616 test cases for implementation-agnostic evaluation
Each benchmark has:
workspace/src/- Source implementation (given to agent)workspace/dst/- Target implementation (agent generates this)workspace/prompt.md- Translation instructionstests/- Hidden pytest tests (only used during evaluation)
# Clone the repository
git clone <repo-url>
cd mcode-benchmark
# Create virtual environment (requires Python 3.12+)
uv venv
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txt
# Copy environment template and add your API key
cp .env.example .env
# Edit .env and add ANTHROPIC_API_KEY
# Copy config template and select benchmarks to run
cp config.toml.example config.toml
# Edit config.toml and uncomment desired benchmark IDsRequirements:
- Docker and Docker Compose
- Python 3.12+
- uv (recommended) or pip
# Set API key in .env or environment
echo 'ANTHROPIC_API_KEY=sk-...' > .env
# Configure benchmarks in config.toml
# Run agent (default timeout: 3600s)
python3 run_agent.py --timeout 3600This automatically:
- Starts Docker container
- Pipes prompt.md to Claude headless
- Agent writes code to workspace/dst/
- Logs saved to logs/agent-/
# Start container for a specific benchmark
./dev.sh charcoal-cli
# Inside container, run any agent manually:
cd /workspace
IS_SANDBOX=true cat prompt.md | claude -p --dangerously-skip-permissions
# Or use other tools (aider, cursor, etc.)
# Exit shell when done - container stops automaticallyThis gives full interactive control inside the dev environment at /workspace.
# Configure which benchmarks to run in config.toml
# Edit the ids array to select benchmarks (see table below for IDs)
# Test source only (default)
python3 run_benchmarks.py --test src
# Test destination only
python3 run_benchmarks.py --test dst
# Test both
python3 run_benchmarks.py --test bothResults are saved to results/results.jsonl and logs to logs/.
For reproducible experiments that combine agent runs with testing, use run_experiment.py:
Create a YAML config file (e.g., my_experiment.yml):
name: my-experiment
agent: claude-code # Agent from agents.yml
benchmarks: # Benchmarks to run
- hello-world-api
- task-management-api
- charcoal-cli
timeout: 3600 # Timeout per iteration (seconds)
parallel: 3 # Number of parallel workers
template_version: v3 # Prompt template version (optional)
description: "My experiment description"Available agents (defined in agents.yml):
claude-code- Claude Code CLI with Claude Opus 4.5codex-cli- OpenAI Codex CLI with GPT-5.2opencode-claude- OpenCode with Claude Opus 4.5opencode-openai- OpenCode with GPT-5.2gemini-cli- Gemini CLI with Gemini 3 Flash
python3 run_experiment.py my_experiment.ymlThis:
- Creates
experiments/<timestamp>_<name>/directory - Copies workspaces for isolation
- Runs agent on each benchmark
- Runs tests automatically
- Saves results to
summary.jsonl
For experiments that measure improvement over multiple iterations:
name: iteration-experiment
agent: claude-code
benchmarks:
- jq-gojq
- charcoal-cli
iterations: 5 # Run 5 iterations per benchmark
test_each_iteration: true # Test after each iteration (for accuracy curves)
timeout: 10800 # Higher timeout for multi-iterationexperiments/
└── 20260115_120000_my-experiment/
├── experiment.yml # Config snapshot
├── agents.yml # Agent config snapshot
├── summary.jsonl # Results for all benchmarks
├── hello-world-api/
│ ├── workspace/ # Isolated workspace copy
│ └── agent_log.jsonl # Agent execution log
└── task-management-api/
└── ...
The following experiment configs in experiments/ were used to generate paper results:
| Config File | Agent | Description |
|---|---|---|
opencode-claude-full.yml |
OpenCode | All benchmarks with Claude Opus 4.5 |
opencode-openai-full.yml |
OpenCode | All benchmarks with GPT-5.2 |
test-v3-claude.yml |
Claude Code | Test runs with v3 template |
test-v3-codex.yml |
Codex CLI | Test runs with v3 template |
iteration-n5-all-per-iter.yml |
Claude Code | Multi-iteration experiment (5 iterations) |
| Benchmark | Source | Target | LOC | Tests | Description |
|---|---|---|---|---|---|
| hello-world-api | Python | Java | 14 | 6 | Minimal REST API |
| task-mgmt-api | Go | Python | 762 | 10 | CRUD task manager |
| bcal | C | Go | 2.5K | 73 | Byte calculator |
| tokei | Rust | Go | 10.1K | 196 | Code statistics tool |
| toml | Go | Python | 13.9K | 647 | TOML parser |
| charcoal-cli | TypeScript | Python | 15.8K | 195 | Git workflow tool |
| httpie-xh | Python | Rust | 19.2K | 101 | HTTP client |
| jmespath | Go | Rust | 19.3K | 888 | JSON query language |
| gitleaks | Go | Rust | 22.4K | 35 | Secret scanner |
| ledger | C++ | Go | 50.0K | 483 | Accounting |
| wabt | C++ | Rust | 54.8K | 433 | WebAssembly toolkit |
| taskwarrior | C++ | Rust | 55.3K | 912 | Task management |
| lightningcss | Rust | Go | 61.7K | 1,779 | CSS parser/minifier |
| bc | C | Rust | 117K | 1,938 | Precision calculator |
| hugo | Go | Rust | 122K | 74 | Static site generator |
| jq-gojq | C | Go | 147K | 430 | JSON processor |
| pdfcpu | Go | Rust | 160K | 141 | PDF processor |
| uncrustify | C++ | Rust | 162K | 2,024 | Code beautifier |
| prettier | JavaScript | Rust | 175K | 539 | Code formatter |
| verible | C++ | Rust | 191K | 148 | SystemVerilog tools |
| qalculate | C++ | Go | 211K | 564 | Math calculator |
Total: 1.6M LOC, 11,616 tests
Use Claude Code commands to streamline the process:
# Step 1: Evaluate if a repo meets selection criteria
/evaluate-benchmark https://github.com/user/repo
# Step 2: If approved, add it as a benchmark
/add-benchmark https://github.com/user/repo <target-language>For detailed criteria and manual process, see:
- SELECTING_BENCHMARKS.md - Selection criteria
- ADDING_BENCHMARKS.md - Manual setup instructions
See REPRODUCE.md for instructions on:
- Regenerating paper results (Tables 1-3)
- Verifying benchmark statistics (LOC, test counts)
If you use RepoMod-Bench in your research, please cite: Xuefeng Li, Nir Ben-Israel, Yotam Raz, Belal Ahmed, Doron Serebro, and Antoine Raux. 2026. RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26).
We welcome contributions to expand the diversity and scale of RepoMod-Bench. By submitting a Pull Request, you agree that your contributions will be licensed under the project's Apache License 2.0. When adding repositories, make sure to respect Upstream Licenses.