Skip to content

Commit acb2e77

Browse files
committed
address feedback
Signed-off-by: SumanthRH <[email protected]>
1 parent 9a7b4b3 commit acb2e77

File tree

4 files changed

+41
-11
lines changed

4 files changed

+41
-11
lines changed

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ dependencies = [
1616
"pydantic",
1717
"setuptools",
1818
"typer",
19+
"hf_transfer",
1920
]
2021
license = { text = "Apache-2.0" }
2122
readme = "README.md"

recipes/sky-t1-preview/README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,11 @@ skythought generate --task apps --model Qwen/QwQ-32B-Preview --backend vllm --ba
3434

3535
skythought generate --task taco --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "MEDIUM"}}' --result-dir $SKYT_HOME/data
3636

37-
skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "math"}}' --result-dir $SKYT_HOME/data
37+
skythought generate --task numina_math --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
3838

39-
skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "amc_aime"}}' --result-dir $SKYT_HOME/data
39+
skythought generate --task numina_amc_aime --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
4040

41-
skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "olympiads"}}' --result-dir $SKYT_HOME/data --start 0 --end 20000
41+
skythought generate --task numina_olympiads --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data --start 0 --end 20000
4242
```
4343

4444
This will save the results in individual folders in `result-dir`. The directory structure should be as follows:
@@ -63,10 +63,13 @@ python scripts/convert_format.py --input_dir $SKYT_HOME/data --keys keys.txt
6363

6464
### Step 3: Reject Sampling on the formatted data (Example Usage with previous script)
6565

66+
For each folder in `result-dir` saved previously (ex: `Qwen_QwQ-32B-Preview_numina_myHash`), obtain the scores with the following command
67+
6668
```shell
6769
skythought score --task apps --path <path_to_run_folder>
6870
```
69-
Similar for other datasets.
71+
72+
This will overwrite the `results.json` files and add a `"correctness"` entry to each model response.
7073

7174
### Convert to ShareGPT format for training
7275
After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above.

skythought/evals/README.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,25 @@ For running OpenAI model, export the OpenAI key.
1010
export OPENAI_API_KEY={openai_api_key}
1111
```
1212

13+
## Usage
14+
15+
We provide three commands in the CLI:
16+
17+
- `skythought evaluate` : Evaluate a model on a given task.
18+
- `skythought generate`: Generate model outputs for a pre-configured task.
19+
- `skythought score`: Score saved generations for a given task.
20+
21+
For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb).
22+
1323
## Generation and Evaluation
1424

1525
### Benchmark Evaluation
1626

17-
Given below are two examples for evaluation. For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb).
27+
Given below are two examples for evaluation.
1828

1929
```shell
2030
skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime --backend vllm --backend-args tensor_parallel_size=8 --sampling-params temperature=0.6,top_p=0.95 --n 8 --result-dir ./
31+
2132
skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task gpqa_diamond --backend vllm --backend-args tensor_parallel_size=8 --sampling-params temperature=0.6,top_p=0.95 --n 8
2233
```
2334

@@ -85,7 +96,13 @@ Currently we support distill and reject sampling for NUMINA, APPS, and TACO data
8596
#### Example Usage
8697

8798
```shell
88-
skythought generate --model Qwen/QwQ-32B-Preview --task apps --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
99+
skythought generate --model Qwen/QwQ-32B-Preview --task numina_amc_aime --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
100+
```
101+
102+
Once the generations are saved, you can then apply any postprocessing on the results (saved in a `results.json` file in separate run folder) and then run:
103+
104+
```shell
105+
skythought score --task numina_amc_aime --run-dir <path>
89106
```
90107

91108
### Reproducibility Issues
@@ -97,4 +114,4 @@ We've noticed that it can be hard to reproduce results in reasoning benchmarks.
97114
- vLLM settings: With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.
98115
- vLLM version: Different versions of vLLM will use different CUDA-Toolkit or Flash attention versions. Even for the same settings, these differences in the underlying kernels used can change results.
99116

100-
We recommend to run all evaluation benchmarks at full precision, i.e float32 to avoid this. By default, we run evaluation in `float32`, which can be customized with the `--backend-args` flag for local inference. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
117+
We recommend to run evaluation benchmarks at full precision, i.e float32 to avoid this. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.

skythought/evals/cli.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929

3030
app = typer.Typer(pretty_exceptions_enable=False)
3131

32+
BACKEND_DEFAULT = "temperature=0,top_p=1,max_tokens=32768"
33+
3234

3335
def get_run_config(
3436
task: str,
@@ -110,7 +112,12 @@ def parse_common_args(
110112
)
111113
task_args_as_dict = parse_multi_args(task_args)
112114
sampling_params_as_dict = parse_multi_args(sampling_params)
113-
backend_args_as_dict = parse_multi_args(backend_args)
115+
user_provided_backend_args_as_dict = parse_multi_args(backend_args)
116+
117+
backend_args_default = parse_multi_args(BACKEND_DEFAULT)
118+
backend_args_as_dict = dict(
119+
**backend_args_default, **user_provided_backend_args_as_dict
120+
)
114121

115122
if n is not None:
116123
sampling_params_as_dict["n"] = n
@@ -189,7 +196,7 @@ def evaluate(
189196
backend_args: Annotated[
190197
str,
191198
typer.Option(
192-
help="Backend parameters to use for inference. For open-source models, we perform inference in float32 by default",
199+
help="Backend parameters to use for inference.",
193200
case_sensitive=False,
194201
),
195202
] = "",
@@ -199,7 +206,7 @@ def evaluate(
199206
help="Sampling parameters to use for inference.",
200207
case_sensitive=False,
201208
),
202-
] = "temperature=0,top_p=1,max_tokens=32768",
209+
] = BACKEND_DEFAULT,
203210
result_dir: Annotated[
204211
str,
205212
typer.Option(
@@ -225,7 +232,9 @@ def evaluate(
225232
seed: Annotated[int, typer.Option(help="Random seed.")] = 41,
226233
assistant_prefill: Annotated[
227234
str,
228-
typer.Option(help=r'Assistant prefill for the model response. Ex: "<think>\n"'),
235+
typer.Option(
236+
help=r'Assistant prefill for the model response, overriding any pre-configured assistant prefill for this model. Ex: "<think>\n"'
237+
),
229238
] = None,
230239
as_test: Annotated[
231240
bool, typer.Option(help="Perform a test run on 10 samples of the dataset.")

0 commit comments

Comments
 (0)