Update README

swe-bench · Feb 4, 2025 · bc400fd · bc400fd
1 parent 53448dd
commit bc400fd
Show file tree

Hide file tree

Showing 3 changed files with 69 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -30,7 +30,8 @@ Code and data for our ICLR 2024 paper <a href="http://swe-bench.github.io/paper.
 Please refer our [website](http://swe-bench.github.io) for the public leaderboard and the [change log](https://github.com/princeton-nlp/SWE-bench/blob/main/CHANGELOG.md) for information on the latest updates to the SWE-bench benchmark.
 
 ## 📰 News
-* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), we've added a new evaluation mode that runs evaluations entirely on the cloud! See 🚀 Set Up on this page for more details.
+* **[Jan. 13, 2025]**: We've integrated [SWE-bench Multimodal](https://swebench.github.io/multimodal) ([paper](https://arxiv.org/abs/2410.03859), [dataset](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal)) into this repository! Unlike SWE-bench, we've kept evaluation for the test split *private*. Submit to the leaderboard using [sb-cli](https://github.com/swe-bench/sb-cli/tree/main), our new cloud-based evaluation tool.
+* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal) for more details.
 * **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
 * **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/princeton-nlp/SWE-bench/blob/main/docs/20240627_docker/README.md).
 * **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/princeton-nlp/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
@@ -69,37 +70,15 @@ python -m swebench.harness.run_evaluation \
     --run_id validate-gold
 ```
 
-### 🌩️ Evaluation with Modal
-You can also run evaluations entirely on the cloud using [Modal](https://modal.com/) to avoid local setup and resource constraints:
-```bash
-python -m swebench.harness.run_evaluation \
-    --predictions_path gold \
-    --run_id validate-gold-modal \
-    --instance_ids sympy__sympy-20590 \
-    --modal true
-```
-This will execute the evaluation harness on Modal's cloud infrastructure, eliminating the need for local Docker setup and resource management.
-
-> [!NOTE]
-> Modal for SWE-bench Multimodal is currently experimental and may not be fully supported yet.
-
 ## 💽 Usage
-> [!WARNING]
-> Running fast evaluations on SWE-bench can be resource intensive
-> We recommend running the evaluation harness on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
-> You may need to experiment with the `--max_workers` argument to find the optimal number of workers for your machine, but we recommend using fewer than `min(0.75 * os.cpu_count(), 24)`.
->
-> If running with docker desktop, make sure to increase your virtual disk space to have ~120 free GB available, and set max_workers to be consistent with the above for the CPUs available to docker.
->
-> Support for `arm64` machines is experimental.
-
-Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
+Evaluate patch predictions on SWE-bench Lite with the following command:
 ```bash
 python -m swebench.harness.run_evaluation \
     --dataset_name princeton-nlp/SWE-bench_Lite \
     --predictions_path <path_to_predictions> \
     --max_workers <num_workers> \
-    --run_id <run_id>
+    --run_id <run_id> \
+    --namespace swebench
     # use --predictions_path 'gold' to verify the gold patches
     # use --run_id to name the evaluation run
 ```
@@ -108,31 +87,38 @@ This command will generate docker build logs (`logs/build_images`) and evaluatio
 
 The final evaluation results will be stored in the `evaluation_results` directory.
 
+> [!WARNING]
+> SWE-bench evaluation can be resource intensive
+> We recommend running on an `x86_64` machine with at least 120GB of free storage, 16GB of RAM, and 8 CPU cores.
+> We recommend using fewer than `min(0.75 * os.cpu_count(), 24)` for `--max_workers`.
+>
+> If running with Docker desktop, make sure to increase your virtual disk space to ~120 free GB. Set max_workers to be consistent with the above for the CPUs available to Docker.
+>
+> Support for `arm64` machines is experimental.
+
 To see the full list of arguments for the evaluation harness, run:
 ```bash
 python -m swebench.harness.run_evaluation --help
 ```
 
-Additionally, the SWE-Bench repo can help you:
-* Train your own models on our pre-processed datasets
-* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it.
-*  Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) on your own repositories, to make new SWE-Bench tasks.
+See the [evaluation tutorial]((./assets/evaluation.md)) for the full rundown on datasets you can evaluate.
+If you're looking for non-local, cloud based evaluations, check out...
+* [sb-cli](https://github.com/swe-bench/sb-cli), our tool for running evaluations automatically on AWS, or...
+* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md#-evaluation-with-modal)
+
+Additionally, you can also:
+* [Train](https://github.com/swe-bench/SWE-bench/tree/main/swebench/inference/make_datasets) your own models on our pre-processed datasets.
+* Run [inference](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/inference/README.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
+*  Run SWE-bench's [data collection procedure](https://github.com/princeton-nlp/SWE-bench/blob/main/swebench/collect/) ([tutorial](./assets/evaluation.md)) on your own repositories, to make new SWE-Bench tasks.
 
 ## ⬇️ Downloads
-| Datasets | Models |
-| - | - |
-| [🤗 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) |
-| [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) |
-| [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) |
-| [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) |
-| [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) | |
-| [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama)   | |
-
-## 🍎 Tutorials
-We've also written the following blog posts on how to use different parts of SWE-bench.
-If you'd like to see a post about a particular topic, please let us know via an issue.
-* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/collection.md))
-* [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/blob/main/assets/evaluation.md))
+| Datasets | Models | RAG |
+| - | - | - |
+| [💿 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) | [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) |
+| [💿 SWE-bench Lite](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) | [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) |
+| [💿 SWE-bench Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) | [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) |
+| [💿 SWE-bench Multimodal](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) | [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) |
+| | | [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama) |
 
 ## 💫 Contributions
 We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues!

diff --git a/assets/evaluation.md b/assets/evaluation.md
@@ -1,5 +1,5 @@
-# Evaluating with SWE-bench
-John Yang &bull; November 6, 2023
+# SWE-bench Evaluation
+John Yang &bull; November 6, 2023 (Updated Feb. 4, 2025)
 
 In this tutorial, we will explain how to evaluate models and methods using SWE-bench.
 
@@ -20,13 +20,40 @@ Store multiple predictions in a `.json` file formatted as `[<prediction 1>, <pre
 If you'd like examples, the [swe-bench/experiments](https://github.com/swe-bench/experiments) GitHub repository contains many examples of well formed patches.
 
 ## 🔄 Running Evaluation
-Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
+Evaluate model predictions on the test split of SWE-bench Lite using the evaluation harness with the following command:
 ```bash
 python -m swebench.harness.run_evaluation \
     --dataset_name princeton-nlp/SWE-bench_Lite \
     --predictions_path <path_to_predictions> \
     --max_workers <num_workers> \
-    --run_id <run_id>
+    --run_id <run_id> \
+    --namespace swebench
     # use --predictions_path 'gold' to verify the gold patches
-    # use --run_id to name the evaluation run
+    # use --run_id to name the run, logs will be written to ./logs/run_evaluation/<run_id>
+    # use --split to specify which split to evaluate on, usually `dev` or `test`
 ```
+
+You can run evaluation for the following (`dataset_name`, `--split`)
+* `princeton-nlp/SWE-bench_Lite`, `test` (300 task instances)
+* `princeton-nlp/SWE-bench_Verified`, `test` (500)
+* `princeton-nlp/SWE-bench`, `dev` (225)
+* `princeton-nlp/SWE-bench`, `test` (2294)
+* `princeton-nlp/SWE-bench_Multimodal`, `dev` (102)
+
+You *cannot* run evaluation on the `test` split of `princeton-nlp/SWE-bench_Multimodal` using this repository (517 instances).
+To encourage less intentional climbing of the leaderboard, we have intentionally made specifications for evaluating the test split private.
+You can submit to the leaderboard using 
+
+### 🌩️ Evaluation with Modal
+You can also run evaluations entirely on the cloud using [Modal](https://modal.com/) to avoid local setup and resource constraints:
+```bash
+python -m swebench.harness.run_evaluation \
+    --predictions_path gold \
+    --run_id validate-gold-modal \
+    --instance_ids sympy__sympy-20590 \
+    --modal true
+```
+This will execute the evaluation harness on Modal's cloud infrastructure, eliminating the need for local Docker setup and resource management.
+
+> [!NOTE]
+> Modal for SWE-bench Multimodal is currently experimental and may not be fully supported yet.
diff --git a/swebench/harness/run_evaluation.py b/swebench/harness/run_evaluation.py
@@ -411,6 +411,13 @@ def main(
     """
     Run evaluation harness for the given dataset and predictions.
     """
+    if dataset_name == "princeton-nlp/SWE-bench_Multimodal" and split == "test":
+        print(
+            "⚠️ Local evaluation for the test split of SWE-bench Multimodal is not supported. "
+            "Please check out sb-cli (https://github.com/swe-bench/sb-cli/) for instructions on how to submit predictions."
+        )
+        return
+
     # set open file limit
     assert len(run_id) > 0, "Run ID must be provided"
     if report_dir is not None: