NovaSky-AI
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 2 additions & 2 deletions b/‎.pre-commit-config.yaml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎skythought/test-time-scaling/README.md‎
Lines changed: 46 additions & 0 deletions b/‎skythought/test-time-scaling/README.md‎
Lines changed: 46 additions & 0 deletions
diff --git a/‎skythought/test-time-scaling/assets/figure1.png‎
89.6 KB b/‎skythought/test-time-scaling/assets/figure1.png‎
89.6 KB
@@ -5,12 +5,12 @@ repos:
       - id: ruff
         args: [ --fix, --exit-non-zero-on-fix ]
         # NOTE (sumanthrh): Many of the files excluded here are used for validating code generation, and linters do not recognize some of the logic in these files. skythought/train is excluded for now because it's a fork of Llamafactory
-        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py|tasks/taco/taco_util\.py|tasks/apps/apps_util\.py|scripts/prompts\.py)$
+        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py|tasks/taco/taco_util\.py|tasks/apps/apps_util\.py|scripts/prompts\.py|skythought/test-time-scaling/.*)$
 
 
   # Black needs to be ran after ruff with --fix
   - repo: https://github.com/psf/black
     rev: 24.10.0
     hooks:
       - id: black
-        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py)$
+        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py|skythought/test-time-scaling/.*)$
@@ -20,6 +20,7 @@
 
 
 # News
+- **[2025/02/21]** 🎉 We released S*: Test time scaling for code generation ([paper](https://arxiv.org/pdf/2502.14382), [code](https://github.com/NovaSky-AI/SkyThought/tree/main/skythought/test-time-scaling)), a simple and extensible test time scaling framework for code generation.
 - **[2025/02/11]** 🎉 We released Sky-T1-7B ([model](https://huggingface.co/NovaSky-AI/Sky-T1-7B)) and Sky-T1-mini ([model](https://huggingface.co/NovaSky-AI/Sky-T1-mini)) to demonstrate the potential of RL in further enhancing model's capability beyond distillation.
 - **[2025/01/23]** ⚡️ We released Sky-T1-32B-Flash ([model](https://huggingface.co/NovaSky-AI/Sky-T1-32B-Flash), [data](https://huggingface.co/datasets/NovaSky-AI/Sky-T1_preference_data_10k)) to tackle overthinking and reduce reasoning sequence lengths while maintaining accuracy.
 - **[2025/01/19]** 🎉 [Chat demo](http://164.152.23.196:3000/) for Sky-T1-32B-Preview is alive! Please check it out!
 
@@ -0,0 +1,46 @@
+# S*: Test Time Scaling for Code Generation ####
+This folder provides the code for the paper "S*: Test Time Scaling for Code Generation".
+
+![Overview of S* approach](assets/figure1.png)
+
+## Installation (Main packages)
+```dspy=2.6.2, torch, vllm```
+
+## Usage
+The scripts to reproduce the results in the paper are in the `scripts` folder.
+- baselines are in `baselines`, `baselines_selfdebug`, `majority_baselines`.
+- experiments on dev set are in: `sec[4,5,6]`.
+- experiments on final test set are in: `final_[]`. First run commands under `final_oracle` to produce all generations without different selection methods, then run commands under `final_[]_cached` to produce generations with different selection methods.
+
+Results are availeble in google cloud storage ([Link](https://drive.google.com/drive/u/1/folders/1kmCoJ7Mkvj-umpkfsA5960hYpNrgH4X4)).
+
+Simple run commands to produce generations with oracle selection and 3 rounds of generation for gpt-4o-mini.
+
+Set OPENAI_API_KEY in your environment variable with `export OPENAI_API_KEY=xxx`.
+
+```
+python evaluate_multiprocess.py \
+    --difficulty=easy \
+    --temperature=0.7 \
+    --num_threads=32 \
+    --n=16 \
+    --selection oracle_all_rounds \
+    --lcb_version release_v2 \
+    --num_round 3 \
+    --result_json_path="results/final_4omini_n_16_debug_public3_select_oracle_easy.json"
+```
+
+To run experiments with local serve models, use ```vllm serve model_name``` to serve the model first.
+
+
+
+#### Citation
+```
+@article{li2025sstar,
+  title={S*: Test Time Scaling for Code Generation},
+  author={Li, Dacheng and Cao, Shiyi and Cao, Chengkun and Li, Xiuyu and Tan, Shangyin and Keutzer, Kurt and Xing, Jiarong and Gonzalez, Joseph E. and Stoica, Ion},
+  year={2025}
+}
+```
+
+