Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
john-b-yang committed Feb 4, 2025
1 parent 53448dd commit bc400fd
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 49 deletions.
74 changes: 30 additions & 44 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@ Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.
Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.

## 📰 News
* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), we've added a new evaluation mode that runs evaluations entirely on the cloud! See 🚀 Set Up on this page for more details.
* **[Jan. 13, 2025]**: We've integrated [SWE-bench Multimodal](https://swebench.github.io/multimodal) ([paper](https://arxiv.org/abs/2410.03859), [dataset](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal)) into this repository! Unlike SWE-bench, we've kept evaluation for the test split *private*. Submit to the leaderboard using [sb-cli](https://github.com/swe-bench/sb-cli/tree/main), our new cloud-based evaluation tool.
* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal) for more details.
* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
Expand Down Expand Up @@ -69,37 +70,15 @@ python -m swebench.harness.run_evaluation \
--run_id validate-gold
```

### 🌩️ Evaluation with Modal
You can also run evaluations entirely on the cloud using [Modal](https://modal.com/) to avoid local setup and resource constraints:
```bash
python -m swebench.harness.run_evaluation \
--predictions_path gold \
--run_id validate-gold-modal \
--instance_ids sympy__sympy-20590 \
--modal true
```
This will execute the evaluation harness on Modal's cloud infrastructure, eliminating the need for local Docker setup and resource management.

> [!NOTE]
> Modal for SWE-bench Multimodal is currently experimental and may not be fully supported yet.
## 💽 Usage
> [!WARNING]
> Running fast evaluations on SWE-bench can be resource intensive
> We recommend running the evaluation harness on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
> You may need to experiment with the `--max_workers` argument to find the optimal number of workers for your machine, but we recommend using fewer than `min(0.75 * os.cpu_count(), 24)`.
>
> If running with docker desktop, make sure to increase your virtual disk space to have ~120 free GB available, and set max_workers to be consistent with the above for the CPUs available to docker.
>
> Support for `arm64` machines is experimental.
Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
Evaluate patch predictions on SWE-bench Lite with the following command:
```bash
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path <path_to_predictions> \
--max_workers <num_workers> \
--run_id <run_id>
--run_id <run_id> \
--namespace swebench
# use --predictions_path 'gold' to verify the gold patches
# use --run_id to name the evaluation run
```
Expand All @@ -108,31 +87,38 @@ This command will generate docker build logs (`logs/build_images`) and evaluatio

The final evaluation results will be stored in the `evaluation_results` directory.

> [!WARNING]
> SWE-bench evaluation can be resource intensive
> We recommend running on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
> We recommend using fewer than `min(0.75 * os.cpu_count(), 24)` for `--max_workers`.
>
> If running with Docker desktop, make sure to increase your virtual disk space to ~120 free GB. Set max_workers to be consistent with the above for the CPUs available to Docker.
>
> Support for `arm64` machines is experimental.
To see the full list of arguments for the evaluation harness, run:
```bash
python -m swebench.harness.run_evaluation --help
```

Additionally, the SWE-Bench repo can help you:
* Train your own models on our pre-processed datasets
* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it.
* Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) on your own repositories, to make new SWE-Bench tasks.
See the [evaluation tutorial]((./assets/evaluation.md)) for the full rundown on datasets you can evaluate.
If you're looking for non-local, cloud based evaluations, check out...
* [sb-cli](https://github.com/swe-bench/sb-cli), our tool for running evaluations automatically on AWS, or...
* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal)

Additionally, you can also:
* [Train](https://github.com/swe-bench/SWE-bench/tree/main/swebench/inference/make_datasets) your own models on our pre-processed datasets.
* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
* Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) ([tutorial](./assets/evaluation.md)) on your own repositories, to make new SWE-Bench tasks.

## ⬇️ Downloads
| Datasets | Models |
| - | - |
| [🤗 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) |
| [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) |
| [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) |
| [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) |
| [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) | |
| [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama) | |

## 🍎 Tutorials
We've also written the following blog posts on how to use different parts of SWE-bench.
If you'd like to see a post about a particular topic, please let us know via an issue.
* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
| Datasets | Models | RAG |
| - | - | - |
| [💿 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) | [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) |
| [💿 SWE-bench Lite](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) | [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) |
| [💿 SWE-bench Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) | [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) |
| [💿 SWE-bench Multimodal](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) | [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) |
| | | [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama) |

## 💫 Contributions
We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues!
Expand Down
37 changes: 32 additions & 5 deletions assets/evaluation.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Evaluating with SWE-bench
John Yang &bull; November 6, 2023
# SWE-bench Evaluation
John Yang &bull; November 6, 2023 (Updated Feb. 4, 2025)

In this tutorial, we will explain how to evaluate models and methods using SWE-bench.

Expand All @@ -20,13 +20,40 @@ Store multiple predictions in a `.json` file formatted as `[<prediction 1>, <pre
If you'd like examples, the [swe-bench/experiments](https://github.com/swe-bench/experiments) GitHub repository contains many examples of well formed patches.

## 🔄 Running Evaluation
Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
Evaluate model predictions on the test split of SWE-bench Lite using the evaluation harness with the following command:
```bash
python -m swebench.harness.run_evaluation \
--dataset_name princeton-nlp/SWE-bench_Lite \
--predictions_path <path_to_predictions> \
--max_workers <num_workers> \
--run_id <run_id>
--run_id <run_id> \
--namespace swebench
# use --predictions_path 'gold' to verify the gold patches
# use --run_id to name the evaluation run
# use --run_id to name the run, logs will be written to ./logs/run_evaluation/<run_id>
# use --split to specify which split to evaluate on, usually `dev` or `test`
```

You can run evaluation for the following (`dataset_name`, `--split`)
* `princeton-nlp/SWE-bench_Lite`, `test` (300 task instances)
* `princeton-nlp/SWE-bench_Verified`, `test` (500)
* `princeton-nlp/SWE-bench`, `dev` (225)
* `princeton-nlp/SWE-bench`, `test` (2294)
* `princeton-nlp/SWE-bench_Multimodal`, `dev` (102)

You *cannot* run evaluation on the `test` split of `princeton-nlp/SWE-bench_Multimodal` using this repository (517 instances).
To encourage less intentional climbing of the leaderboard, we have intentionally made specifications for evaluating the test split private.
You can submit to the leaderboard using

### 🌩️ Evaluation with Modal
You can also run evaluations entirely on the cloud using [Modal](https://modal.com/) to avoid local setup and resource constraints:
```bash
python -m swebench.harness.run_evaluation \
--predictions_path gold \
--run_id validate-gold-modal \
--instance_ids sympy__sympy-20590 \
--modal true
```
This will execute the evaluation harness on Modal's cloud infrastructure, eliminating the need for local Docker setup and resource management.

> [!NOTE]
> Modal for SWE-bench Multimodal is currently experimental and may not be fully supported yet.
7 changes: 7 additions & 0 deletions swebench/harness/run_evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,13 @@ def main(
"""
Run evaluation harness for the given dataset and predictions.
"""
if dataset_name == "princeton-nlp/SWE-bench_Multimodal" and split == "test":
print(
"⚠️ Local evaluation for the test split of SWE-bench Multimodal is not supported. "
"Please check out sb-cli (https://github.com/swe-bench/sb-cli/) for instructions on how to submit predictions."
)
return

# set open file limit
assert len(run_id) > 0, "Run ID must be provided"
if report_dir is not None:
Expand Down

0 comments on commit bc400fd

Please sign in to comment.