address feedback

SumanthRH · SumanthRH · commit acb2e77b774c · 2025-02-25T23:22:15.000-08:00
Signed-off-by: SumanthRH &lt;sumanthrh@anyscale.com&gt;
diff --git a/pyproject.toml b/pyproject.toml
@@ -16,6 +16,7 @@ dependencies = [
     "pydantic",
     "setuptools",
     "typer",
+    "hf_transfer",
 ]
 license = { text = "Apache-2.0" }
 readme = "README.md"
diff --git a/recipes/sky-t1-preview/README.md b/recipes/sky-t1-preview/README.md
@@ -34,11 +34,11 @@ skythought generate --task apps --model Qwen/QwQ-32B-Preview --backend vllm --ba
 
 skythought generate --task taco --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "MEDIUM"}}' --result-dir $SKYT_HOME/data
 
-skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "math"}}' --result-dir $SKYT_HOME/data
+skythought generate --task numina_math --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
 
-skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "amc_aime"}}' --result-dir $SKYT_HOME/data
+skythought generate --task numina_amc_aime --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
 
-skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "olympiads"}}' --result-dir $SKYT_HOME/data --start 0 --end 20000
+skythought generate --task numina_olympiads --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data --start 0 --end 20000
 ```
 
 This will save the results in individual folders in `result-dir`. The directory structure should be as follows:
@@ -63,10 +63,13 @@ python scripts/convert_format.py --input_dir $SKYT_HOME/data --keys keys.txt
 
 ### Step 3: Reject Sampling on the formatted data (Example Usage with previous script)
 
+For each folder in `result-dir` saved previously (ex: `Qwen_QwQ-32B-Preview_numina_myHash`), obtain the scores with the following command
+
 ```shell 
 skythought score --task apps --path <path_to_run_folder>
 ```
-Similar for other datasets.
+
+This will overwrite the `results.json` files and add a `"correctness"` entry to each model response. 
 
 ### Convert to ShareGPT format for training
 After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above.
diff --git a/skythought/evals/README.md b/skythought/evals/README.md
@@ -10,14 +10,25 @@ For running OpenAI model, export the OpenAI key.
 export OPENAI_API_KEY={openai_api_key}
 ```
 
+## Usage
+
+We provide three commands in the CLI: 
+
+- `skythought evaluate` : Evaluate a model on a given task.
+- `skythought generate`: Generate model outputs for a pre-configured task.
+- `skythought score`: Score saved generations for a given task.
+
+For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb). 
+
 ## Generation and Evaluation
 
 ### Benchmark Evaluation
 
-Given below are two examples for evaluation. For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb). 
+Given below are two examples for evaluation.
 
 ```shell
 skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime  --backend vllm --backend-args tensor_parallel_size=8  --sampling-params temperature=0.6,top_p=0.95 --n 8 --result-dir ./
+
 skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task gpqa_diamond --backend vllm --backend-args tensor_parallel_size=8 --sampling-params temperature=0.6,top_p=0.95 --n 8
 ```
 
@@ -85,7 +96,13 @@ Currently we support distill and reject sampling for NUMINA, APPS, and TACO data
 #### Example Usage
 
 ```shell
-skythought generate --model Qwen/QwQ-32B-Preview --task apps --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
+skythought generate --model Qwen/QwQ-32B-Preview --task numina_amc_aime --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
+```
+
+Once the generations are saved, you can then apply any postprocessing on the results (saved in a `results.json` file in separate run folder) and then run:
+
+```shell
+skythought score --task numina_amc_aime --run-dir <path>
 ```
 
 ### Reproducibility Issues
@@ -97,4 +114,4 @@ We've noticed that it can be hard to reproduce results in reasoning benchmarks.
 - vLLM settings:  With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.
 - vLLM version: Different versions of vLLM will use different CUDA-Toolkit or Flash attention versions. Even for the same settings, these differences in the underlying kernels used can change results. 
 
- We recommend to run all evaluation benchmarks at full precision, i.e float32 to avoid this. By default, we run evaluation in `float32`, which can be customized with the `--backend-args` flag for local inference. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
+ We recommend to run evaluation benchmarks at full precision, i.e float32 to avoid this. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
diff --git a/skythought/evals/cli.py b/skythought/evals/cli.py
@@ -29,6 +29,8 @@
 
 app = typer.Typer(pretty_exceptions_enable=False)
 
+BACKEND_DEFAULT = "temperature=0,top_p=1,max_tokens=32768"
+
 
 def get_run_config(
     task: str,
@@ -110,7 +112,12 @@ def parse_common_args(
         )
     task_args_as_dict = parse_multi_args(task_args)
     sampling_params_as_dict = parse_multi_args(sampling_params)
-    backend_args_as_dict = parse_multi_args(backend_args)
+    user_provided_backend_args_as_dict = parse_multi_args(backend_args)
+
+    backend_args_default = parse_multi_args(BACKEND_DEFAULT)
+    backend_args_as_dict = dict(
+        **backend_args_default, **user_provided_backend_args_as_dict
+    )
 
     if n is not None:
         sampling_params_as_dict["n"] = n
@@ -189,7 +196,7 @@ def evaluate(
     backend_args: Annotated[
         str,
         typer.Option(
-            help="Backend parameters to use for inference. For open-source models, we perform inference in float32 by default",
+            help="Backend parameters to use for inference.",
             case_sensitive=False,
         ),
     ] = "",
@@ -199,7 +206,7 @@ def evaluate(
             help="Sampling parameters to use for inference.",
             case_sensitive=False,
         ),
-    ] = "temperature=0,top_p=1,max_tokens=32768",
+    ] = BACKEND_DEFAULT,
     result_dir: Annotated[
         str,
         typer.Option(
@@ -225,7 +232,9 @@ def evaluate(
     seed: Annotated[int, typer.Option(help="Random seed.")] = 41,
     assistant_prefill: Annotated[
         str,
-        typer.Option(help=r'Assistant prefill for the model response. Ex: "<think>\n"'),
+        typer.Option(
+            help=r'Assistant prefill for the model response, overriding any pre-configured assistant prefill for this model. Ex: "<think>\n"'
+        ),
     ] = None,
     as_test: Annotated[
         bool, typer.Option(help="Perform a test run on 10 samples of the dataset.")

Original file line number	Diff line number	Diff line change
`@@ -16,6 +16,7 @@ dependencies = [`
`16`	`16`	`"pydantic",`
`17`	`17`	`"setuptools",`
`18`	`18`	`"typer",`
	`19`	`+ "hf_transfer",`
`19`	`20`	`]`
`20`	`21`	`license = { text = "Apache-2.0" }`
`21`	`22`	`readme = "README.md"`