Skip to content

Modelcode-ai/mcode-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RepoMod-Bench

1. Overview

RepoMod-Bench evaluates AI agents' ability to translate software projects between languages/frameworks while maintaining functional equivalence. It supports:

  • 21 benchmarks across CLI tools and REST APIs
  • 8 programming languages: C, C++, Go, Java, JavaScript, Python, Rust, TypeScript
  • 1.6M lines of code total, with repositories ranging from 14 to 211K LOC
  • 11,616 test cases for implementation-agnostic evaluation

Each benchmark has:

  • workspace/src/ - Source implementation (given to agent)
  • workspace/dst/ - Target implementation (agent generates this)
  • workspace/prompt.md - Translation instructions
  • tests/ - Hidden pytest tests (only used during evaluation)

2. Setup

# Clone the repository
git clone <repo-url>
cd mcode-benchmark

# Create virtual environment (requires Python 3.12+)
uv venv
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Copy environment template and add your API key
cp .env.example .env
# Edit .env and add ANTHROPIC_API_KEY

# Copy config template and select benchmarks to run
cp config.toml.example config.toml
# Edit config.toml and uncomment desired benchmark IDs

Requirements:

  • Docker and Docker Compose
  • Python 3.12+
  • uv (recommended) or pip

3. Running Agent to Generate Destination

Option A: Automated via run_agent.py

# Set API key in .env or environment
echo 'ANTHROPIC_API_KEY=sk-...' > .env

# Configure benchmarks in config.toml
# Run agent (default timeout: 3600s)
python3 run_agent.py --timeout 3600

This automatically:

  1. Starts Docker container
  2. Pipes prompt.md to Claude headless
  3. Agent writes code to workspace/dst/
  4. Logs saved to logs/agent-/

Option B: Interactive via dev.sh

# Start container for a specific benchmark
./dev.sh charcoal-cli

# Inside container, run any agent manually:
cd /workspace
IS_SANDBOX=true cat prompt.md | claude -p --dangerously-skip-permissions

# Or use other tools (aider, cursor, etc.)
# Exit shell when done - container stops automatically

This gives full interactive control inside the dev environment at /workspace.

4. Running Tests Against Implementations

# Configure which benchmarks to run in config.toml
# Edit the ids array to select benchmarks (see table below for IDs)

# Test source only (default)
python3 run_benchmarks.py --test src

# Test destination only
python3 run_benchmarks.py --test dst

# Test both
python3 run_benchmarks.py --test both

Results are saved to results/results.jsonl and logs to logs/.

5. Running Experiments (Reproducible)

For reproducible experiments that combine agent runs with testing, use run_experiment.py:

Experiment Configuration

Create a YAML config file (e.g., my_experiment.yml):

name: my-experiment
agent: claude-code          # Agent from agents.yml
benchmarks:                 # Benchmarks to run
  - hello-world-api
  - task-management-api
  - charcoal-cli
timeout: 3600               # Timeout per iteration (seconds)
parallel: 3                 # Number of parallel workers
template_version: v3        # Prompt template version (optional)
description: "My experiment description"

Available agents (defined in agents.yml):

  • claude-code - Claude Code CLI with Claude Opus 4.5
  • codex-cli - OpenAI Codex CLI with GPT-5.2
  • opencode-claude - OpenCode with Claude Opus 4.5
  • opencode-openai - OpenCode with GPT-5.2
  • gemini-cli - Gemini CLI with Gemini 3 Flash

Run Experiment

python3 run_experiment.py my_experiment.yml

This:

  1. Creates experiments/<timestamp>_<name>/ directory
  2. Copies workspaces for isolation
  3. Runs agent on each benchmark
  4. Runs tests automatically
  5. Saves results to summary.jsonl

Multi-Iteration Experiments

For experiments that measure improvement over multiple iterations:

name: iteration-experiment
agent: claude-code
benchmarks:
  - jq-gojq
  - charcoal-cli
iterations: 5              # Run 5 iterations per benchmark
test_each_iteration: true  # Test after each iteration (for accuracy curves)
timeout: 10800             # Higher timeout for multi-iteration

Output Structure

experiments/
└── 20260115_120000_my-experiment/
    ├── experiment.yml       # Config snapshot
    ├── agents.yml           # Agent config snapshot
    ├── summary.jsonl        # Results for all benchmarks
    ├── hello-world-api/
    │   ├── workspace/       # Isolated workspace copy
    │   └── agent_log.jsonl  # Agent execution log
    └── task-management-api/
        └── ...

Experiment Configs Used in Paper

The following experiment configs in experiments/ were used to generate paper results:

Config File Agent Description
opencode-claude-full.yml OpenCode All benchmarks with Claude Opus 4.5
opencode-openai-full.yml OpenCode All benchmarks with GPT-5.2
test-v3-claude.yml Claude Code Test runs with v3 template
test-v3-codex.yml Codex CLI Test runs with v3 template
iteration-n5-all-per-iter.yml Claude Code Multi-iteration experiment (5 iterations)

Available Benchmarks

Benchmark Source Target LOC Tests Description
hello-world-api Python Java 14 6 Minimal REST API
task-mgmt-api Go Python 762 10 CRUD task manager
bcal C Go 2.5K 73 Byte calculator
tokei Rust Go 10.1K 196 Code statistics tool
toml Go Python 13.9K 647 TOML parser
charcoal-cli TypeScript Python 15.8K 195 Git workflow tool
httpie-xh Python Rust 19.2K 101 HTTP client
jmespath Go Rust 19.3K 888 JSON query language
gitleaks Go Rust 22.4K 35 Secret scanner
ledger C++ Go 50.0K 483 Accounting
wabt C++ Rust 54.8K 433 WebAssembly toolkit
taskwarrior C++ Rust 55.3K 912 Task management
lightningcss Rust Go 61.7K 1,779 CSS parser/minifier
bc C Rust 117K 1,938 Precision calculator
hugo Go Rust 122K 74 Static site generator
jq-gojq C Go 147K 430 JSON processor
pdfcpu Go Rust 160K 141 PDF processor
uncrustify C++ Rust 162K 2,024 Code beautifier
prettier JavaScript Rust 175K 539 Code formatter
verible C++ Rust 191K 148 SystemVerilog tools
qalculate C++ Go 211K 564 Math calculator

Total: 1.6M LOC, 11,616 tests


Selecting and Adding New Benchmarks

Use Claude Code commands to streamline the process:

# Step 1: Evaluate if a repo meets selection criteria
/evaluate-benchmark https://github.com/user/repo

# Step 2: If approved, add it as a benchmark
/add-benchmark https://github.com/user/repo <target-language>

For detailed criteria and manual process, see:


Reproducing Results

See REPRODUCE.md for instructions on:

  • Regenerating paper results (Tables 1-3)
  • Verifying benchmark statistics (LOC, test counts)

Citation

If you use RepoMod-Bench in your research, please cite: Xuefeng Li, Nir Ben-Israel, Yotam Raz, Belal Ahmed, Doron Serebro, and Antoine Raux. 2026. RepoMod-Bench: A Benchmark for Code Repository Modernization via Implementation-Agnostic Testing. In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '26).

Contributing to RepoMod-Bench

We welcome contributions to expand the diversity and scale of RepoMod-Bench. By submitting a Pull Request, you agree that your contributions will be licensed under the project's Apache License 2.0. When adding repositories, make sure to respect Upstream Licenses.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors