We present EnterpriseBench, a new commercially grounded benchmark designed to evaluate the capabilities of AI agents in solving real-world software engineering tasks.
Addressing limitations in existing benchmarks, we introduce two versions: one based on SWE-bench methodology, featuring a curated set of high-quality selected tasks, and another employing a test-driven development (TDD) paradigm with 147 selected tasks across 3 repositories. Tasks originate from authentic enterprise Jira tickets and cover diverse issue types including bug fixes, and feature implementations. Visual task elements are transformed into textual descriptions using multimodal models. To improve experimentation efficiency, we propose a novel cost efficient strategy based on early agent-model pair selection using limited repositories. Additionally, we introduce experimental stub projects methodology and data, to assess agent performance in complex pipeline construction, offering a stripped-down project skeleton with matching tickets and tests. The benchmark was tested on state of the art AI coding agents. Our dataset is unique in its exclusive use of proprietary commercial data, preventing answer leakage and ensuring non-contamination of current LLM training sets.
# Clone the framework
$ git clone https://github.com/exadel-inc/EnterpriseBench.git
$ cd EnterpriseBench
Tool | Version | Notes |
---|---|---|
Ubuntu | 20.04.6 LTS (tested) | Other OSes are not yet supported |
Java Development Kit (JDK) | AuthoringToolKit (JDK8) CompreFace (JDK17) DynamicMailboxes (JDK11) |
Set JVM_DIR to the JDK home if it is not on your PATH |
Maven | Apache Maven 3.6.3 | Used to build the target repo |
Python | 3.12 | Required for the orchestration scripts |
Git | latest | Required for checking out historical commits |
ποΈ Note: When working with the AuthoringToolKit repository you must force the benchmark to use Java 8 by adding
--java-major 8
to the command line of both4_run_all_tickets.py
and any direct calls to3_run_ticket_test.py
.
ποΈ Simply run
utils/install_dependencies.sh
to install required system dependencies andutils/prepare_dataverse.sh
to download the Harvard Dataverse archive if needed, which creates the corrects folder layout and renames/unpacks everything exactly as required.
The install_dependencies.sh
script will:
- Ensure bash, curl, git, and unzip are installed.
- Install Apache Maven 3.6.3 if
mvn
is missing. - Install Java SDKs 8, 11, and 17 (on Ubuntu/Debian; other distros print a hint).
- Ensure pip3 is available (installing
python3-pip
if missing). - Install the pandas Python package.
The prepare_dataverse.sh
script will:
- Download the Harvard Dataverse archive (DOI 10.7910/DVN/S4WOTJ) if
dataverse_files.zip
is not already present. - Extract the archive into a clean
dataverse_files/
directory β unless that folder already exists and is nonβempty, in which case the script skips all extraction & rename work. - Inside every project subfolder it
β’ renames*.csv
βpr_states.csv
β’ unpackspatches_neg*
/patches_pos*
ZIPs into flatpatches_neg/
andpatches_pos/
folders
β’ unzips the main repo archive into a flatproject_repo/
folder
β’ creates ajvm
symlink pointing to/usr/lib/jvm
(so the benchmark finds all installed JDKs).
ποΈ Reβrunning the script is idempotent: it detects an existing
dataverse_files/
directory and exits without touching your data or reinstalling the JDKs.
After the script finishes, point --project-root
at one of the unpacked project subβfolders (e.g., dataverse_files/CompreFace
) and jump straight to the Running the Benchmark section.
EnterpriseBench expects the following artefacts for every benchmark run:
project_root
- provide the root directory of the benchmark project by calling the script with the--project-root
argument in the4_run_all_tickets.py
script.pr_states.csv
β the mapping between issue/ticket IDs and the commit SHA(s) that resolved them.project_repo
β the full Git history of the benchmark project.patches_neg/
β negative git diff patches.patches_ai/
β AI agent git diff patches (default; can be overridden via--ai-patches-dir
).
ποΈ Rename / copy your dataset file to
pr_states.csv
(e.g.dataset_CF_anonymized.csv β pr_states.csv
). The scripts look for that exact filename by default.
The directory layout with necessary files should look like this:
project_root/
βββ pr_states.csv
βββ project_repo/ # cloned target project
βββ patches_neg/
β βββ <ticket1>_non_test.diff
β βββ ...
βββ patches_ai/
β βββ <patch_set1>/
β β βββ <ticket1>_non_test.diff
β β βββ ...
β βββ <patch_set2>/
β βββ <ticket1>_non_test.diff
β βββ ...
$ python3 4_run_all_tickets.py --project-root dataverse_files/CompreFace
The following commands apply your AIβgenerated patch sets to each of the three benchmark projects that ship in the Harvard Dataverse archive.
Adjust the --ai-patches-dir
argument to point at the directory that contains your <ticket>_non_test.diff
files. If --ai-patches-dir
is omitted, the script defaults to the patches_ai
directory within the project root. The script supports multiple AI patch sets. If the provided AI patch directory contains subfolders, each is treated as a distinct patch set and processed separately.
$ python3 4_run_all_tickets.py --ai --project-root dataverse_files/CompreFace
Place the golden patches in the patches_pos/
directory under the project root (e.g., dataverse_files/CompreFace/patches_pos
).
$ python3 3_run_ticket_test.py MM-62925 patches_pos/MM-62925_non_test.diff
Results from each run are saved in the test_results.csv
CSV file and in the results/
directory. This is a helper script to summarize and display results from benchmark runs.
$ python3 5_measure_scores.py dataverse_files/CompreFace
# 1) AuthoringToolKit β this repo must be built with Java 8
python3 4_run_all_tickets.py \
--project-root dataverse_files/AuthoringToolKit \
--java-major 8 \
--ai \
--ai-patches-dir PATCHES_EAK_TDD_DEEPSEEK_mSWE_AGENT_CL2
# 2) CompreFace
python3 4_run_all_tickets.py \
--project-root dataverse_files/CompreFace \
--ai \
--ai-patches-dir PATCHES_CF_classic_GPT_4o_MINI_mSWE_AGENT_CL_1
# 3) DynamicMailboxes
python3 4_run_all_tickets.py \
--project-root dataverse_files/DynamicMailboxes \
--ai \
--ai-patches-dir PATCHES_DMB_classic_GPT_4o_MINI_mSWE_AGENT_CL_1
Flag | Purpose | Default |
---|---|---|
TICKET |
PR ticket ID to test (positional) | required |
PATCH |
Optional diff file (<ticket>_non_test.diff ) |
β |
--ai |
Skip base + merge stages; run only negative/code stage | off |
--project-root PATH |
Root of the benchmark project | scriptβs folder |
--java-major N |
Force Java major version (e.g., 8, 17) | highest JDK found |
Flag | Purpose | Default |
---|---|---|
--ai |
Run only the AIβpatch stage (skips base + merge) | off |
--ai-patches-dir PATH |
Directory containing <ticket>_non_test.diff files. If the flag is omitted, the script defaults to the patches_ai directory within the project root. |
patches_ai project folder |
--project-root PATH |
Root of the benchmark project | scriptβs folder |
--java-major N |
Force Java major version (e.g., 8, 17) | highest JDK found |
Argument | Purpose | Default |
---|---|---|
<folder_path> |
Directory containing CSV files to summarize and display | required |
-h , --help |
Show the help message | N/A |
All parameters are documented via -h/--help
.
Symptom | Fix |
---|---|
java: command not found |
Check your JDK installation and JVM_DIR . |
Maven canβt resolve dependencies | Make sure the target project builds without EnterpriseBench first. |
"FileNotFound: pr_states.csv" | Confirm you renamed your dataset correctly or pass --dataset to the script. |
Distributed under the Apache 2.0 license β see LICENSE
for details.