Minor doc updates

swe-bench · Jul 29, 2024 · c2b3cef · c2b3cef
1 parent 9802a2c
commit c2b3cef
Show file tree

Hide file tree

Showing 8 changed files with 38 additions and 100 deletions.
diff --git a/README.md b/README.md
@@ -89,7 +89,7 @@ python -m swebench.harness.run_evaluation \
     # use --run_id to name the evaluation run
 ```
 
-This command will generate docker build logs (`build_image_logs`) and evaluation logs (`run_instance_logs`) in the current directory.
+This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory.
 
 The final evaluation results will be stored in the `evaluation_results` directory.
 
@@ -116,7 +116,7 @@ Additionally, the SWE-Bench repo can help you:
 ## 🍎 Tutorials
 We've also written the following blog posts on how to use different parts of SWE-bench.
 If you'd like to see a post about a particular topic, please let us know via an issue.
-* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md))
+* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
 * [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))
 
 ## 💫 Contributions

diff --git a/swebench/collect/collection.md → assets/collection.md b/swebench/collect/collection.md → assets/collection.md
diff --git a/assets/evaluation.md b/assets/evaluation.md
@@ -0,0 +1,32 @@
+# Evaluating with SWE-bench
+John Yang &bull; November 6, 2023
+
+In this tutorial, we will explain how to evaluate models and methods using SWE-bench.
+
+## 🤖 Creating Predictions
+For each task instance of the SWE-bench dataset, given an issue (`problem_statement`) + codebase (`repo` + `base_commit`), your model should attempt to write a diff patch prediction. For full details on the SWE-bench task, please refer to Section 2 of the main paper.
+
+Each prediction must be formatted as follows:
+```json
+{
+    "instance_id": "<Unique task instance ID>",
+    "model_patch": "<.patch file content string>",
+    "model_name_or_path": "<Model name here (i.e. SWE-Llama-13b)>",
+}
+```
+
+Store multiple predictions in a `.json` file formatted as `[<prediction 1>, <prediction 2>,... <prediction n>]`. It is not necessary to generate predictions for every task instance.
+
+If you'd like examples, the [swe-bench/experiments](https://github.com/swe-bench/experiments) GitHub repository contains many examples of well formed patches.
+
+## 🔄 Running Evaluation
+Evaluate model predictions on SWE-bench Lite using the evaluation harness with the following command:
+```bash
+python -m swebench.harness.run_evaluation \
+    --dataset_name princeton-nlp/SWE-bench_Lite \
+    --predictions_path <path_to_predictions> \
+    --max_workers <num_workers> \
+    --run_id <run_id>
+    # use --predictions_path 'gold' to verify the gold patches
+    # use --run_id to name the evaluation run
+```
diff --git a/docs/README_CN.md b/docs/README_CN.md
@@ -63,7 +63,7 @@ SWE-bench 是一个用于评估大型语言模型的基准，这些模型是从
 ## 🍎 教程
 我们还写了关于如何使用SWE-bench不同部分的博客文章。
 如果您想看到关于特定主题的文章，请通过问题告诉我们。
-* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md.md))
+* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
 * [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))
 
 ## 💫 贡献

diff --git a/docs/README_JP.md b/docs/README_JP.md
@@ -65,7 +65,7 @@ SWE-Bench を使用するには、以下のことができます:
 ## 🍎 チュートリアル 
 SWE-benchの様々な部分の使い方についても、以下のブログ記事を書いています。
 特定のトピックについての投稿を見たい場合は、issueでお知らせください。
-* [2023年11月1日] SWE-Benchの評価タスクの収集について ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md.md))
+* [2023年11月1日] SWE-Benchの評価タスクの収集について ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
 * [2023年11月6日] SWE-benchでの評価について ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))
 
 ## 💫 貢献

diff --git a/docs/README_TW.md b/docs/README_TW.md
@@ -63,7 +63,7 @@ SWE-bench 是一個用於評估大型語言模型的基準，這些模型是從
 ## 🍎 教程
 我們還撰寫了以下有關如何使用SWE-bench不同部分的博客文章。
 如果您想看到有關特定主題的文章，請通過問題告訴我們。
-* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md.md))
+* [Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md))
 * [Nov 6. 2023] Evaluating on SWE-bench ([🔗](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/harness/evaluation.md))
 
 ## 💫 貢獻

diff --git a/swebench/collect/README.md b/swebench/collect/README.md
@@ -1,7 +1,7 @@
 # Data Collection
 This folder includes the code for the first two parts of the benchmark construction procedure as described in the paper, specifically 1. Repo selection and data scraping, and 2. Attribute-based filtering.
 
-We include a comprehensive [tutorial](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/collect/collection.md) that describes the end-to-end procedure for collecting evaluation task instances from PyPI repositories.
+We include a comprehensive [tutorial](https://github.com/princeton-nlp/SWE-bench/tree/main/swebench/assets/collection.md) that describes the end-to-end procedure for collecting evaluation task instances from PyPI repositories.
 
 > SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.
 

diff --git a/swebench/harness/evaluation.md b/swebench/harness/evaluation.md