Skip to content

Commit 4c2085a

Browse files
authored
fix docs; add numina tasks (#79)
- Fixes some broken links after #77 . `skythought_evals` has been renamed to `evals` and the package name is `skythought`. - Added separate yamls for Numina for a better quickstart experience. Ideally, we shouldn't have to keep adding yamls for all the training datasets in the evaluation library, and should instead provide APIs for standalone scripts. For now we do this to support reproduction of Sky-T1 models. Signed-off-by: SumanthRH <[email protected]>
1 parent 3deec3b commit 4c2085a

File tree

12 files changed

+104
-22
lines changed

12 files changed

+104
-22
lines changed

.github/workflows/cpu_ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ jobs:
2626
cache: 'pip'
2727
- name: Install dependencies
2828
run: python -m pip install --upgrade pip setuptools wheel pre-commit
29-
- name: Install skythought_evals
29+
- name: Install skythought
3030
run: python -m pip install -e ".[dev]"
3131
- name: Run pre-commit hooks
3232
run: pre-commit run --all-files --config .pre-commit-config.yaml
@@ -46,7 +46,7 @@ jobs:
4646
cache: 'pip'
4747
- name: Install dependencies
4848
run: python -m pip install --upgrade pip setuptools wheel pre-commit pytest
49-
- name: Install skythought_evals
49+
- name: Install skythought
5050
run: python -m pip install -e ".[dev]"
5151
- name: Run tests
5252
run: python -m pytest tests/

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737

3838
We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
3939
- [`recipes`](./recipes/): Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash`, `Sky-T1-32B-Preview` and `Sky-T1-7B` series.
40-
- [`skythought/skythought_evals`](./skythought/skythought_evals/): Our data generation and evaluation library.
40+
- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library.
4141
- [`skythought/train`](./skythought/train/): Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training.
4242
- [`skythought/skythought-rl`](./skythought/skythought-rl/): RL training code for Sky-T1-7B and Sky-T1-mini.
4343

@@ -79,7 +79,7 @@ We support a wide variety of datasets in mathematics, science and coding:
7979
- GSM8K
8080
- AIME'25
8181

82-
For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/skythought_evals/README.md).
82+
For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/evals/README.md).
8383

8484

8585
### Evaluation results
@@ -112,7 +112,7 @@ We also evaluate on non-reasoning benchmarks (these are benchmarks for instructi
112112
| BFCL-v3 | 53.18 | **58.92** | 17.41 | [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) |
113113
| Arena-Hard | **74.79** | 66.51 | 52.6 | [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) |
114114

115-
For more details, refer [here](./skythought/skythought_evals/base_instruct_evals.md).
115+
For more details, refer [here](./skythought/evals/base_instruct_evals.md).
116116

117117
## Fully Open-source: Driving Progress Together
118118
We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results *easily*:

examples/evaluate.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@
9494
"1. Quick Start\n",
9595
"\n",
9696
"```bash\n",
97-
"skythought_evals evaluate \\\n",
97+
"skythought evaluate \\\n",
9898
"--task aime24 \\ \n",
9999
"--model NovaSky-AI/Sky-T1-32B-Preview \\\n",
100100
"--backend vllm \\\n",
@@ -124,7 +124,7 @@
124124
"### Key Concepts\n",
125125
"\n",
126126
"- Task: A task is an evaluation dataset. We use the `task` argument to retrieve the corresponding configuration file from our pre-configured benchmarks (To see the available tasks, use `skythought evaluate --help`) \n",
127-
"- Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
127+
"- Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
128128
"- Backend: The Backend is concerned with how the LLM instance is created and queried. We support a variety of backends via the `backend` argument. \n",
129129
" - The `openai` backend can be used to query OpenAI-compatible endpoints. Example: `--backend openai --backend-args base_url=https://api.openai.com`\n",
130130
" - The `vllm` backend instantiates a local model instance with [vLLM](docs.vllm.ai) for efficient inference. \n",
@@ -154,7 +154,7 @@
154154
"cell_type": "markdown",
155155
"metadata": {},
156156
"source": [
157-
"For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/skythought_evals/README.md)"
157+
"For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/evals/README.md)"
158158
]
159159
}
160160
],

recipes/sky-t1-7b/README.md

Lines changed: 43 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,58 @@ For a detailed training recipes and technical details, refer to the [blog](https
33

44
## SFT: Step 1 and Step 3 SFT
55
### Distillation Data Mixture
6-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
6+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
77

88
For the distillation performed in step 1 and step 3, we use the following script for data generation. Replace the `$MODEL_NAME` with the model to be distilled from.
9+
For each subset (`numina_math`, `numina_olympaids`, etc), the `score` command requires the output directory from the previous `generate` command.
10+
911
```shell
10-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --inference
11-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --check
12+
skythought generate \
13+
--task numina_math \
14+
--model $MODEL_NAME \
15+
--backend vllm \
16+
--backend-args tensor_parallel_size=4 \
17+
--sampling-params max_tokens=16384 \
18+
--result-dir ./data
1219

13-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --inference
14-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --check
20+
skythought score \
21+
--task numina_math \
22+
--run-dir <path to output folder from generate>
23+
```
1524

16-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --inference
17-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --check
25+
```shell
26+
skythought generate \
27+
--task numina_olympiads \
28+
--model $MODEL_NAME \
29+
--backend vllm \
30+
--backend-args tensor_parallel_size=4 \
31+
--sampling-params max_tokens=16384 \
32+
--end 40000 \
33+
--result-dir ./data
34+
35+
skythought score \
36+
--task numina_olympiads \
37+
--run-dir <path to output folder from generate>
1838
```
39+
40+
```shell
41+
skythought generate \
42+
--task numina_amc_aime \
43+
--model $MODEL_NAME \
44+
--backend vllm \
45+
--backend-args tensor_parallel_size=4 \
46+
--sampling-params max_tokens=16384 \
47+
--result-dir ./data
48+
49+
skythought score \
50+
--task numina_math \
51+
--run-dir <path to output folder from generate>
52+
```
53+
1954
For step 1 and step 3 SFT, follow the instructions in `skythought/train`.
2055

2156
## RL: Step 2 and Step 4 RL
2257
For RL training, install our modified fork of [VeRL](https://github.com/volcengine/verl) under `skythought/skythought-rl` and follow the instructions there, we also incorporate the math and coding testing utils from the [PRIME](https://github.com/PRIME-RL/PRIME) repo.
2358

2459
## Evaluation
25-
For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).
60+
For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).

recipes/sky-t1-flash/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ For a detailed breakdown of the duration curation steps and training methodology
66

77
## Setup
88

9-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
9+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
1010

1111

1212
## Stage 1: Data Generation

recipes/sky-t1-preview/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Give below are the instructions to replicate the data preprocessing and training
66

77
## Setup
88

9-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
9+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
1010
Set the env variable `SKYT_HOME` as the directory for the final dataset.
1111

1212
## Training Data Curation

scripts/response_rewrite.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,12 @@
33
import os
44
import random
55

6-
from skythought_evals.models import ModelConfig
7-
from skythought_evals.util.math_parsing_util import strip_answer_string
86
from tqdm import tqdm
97
from vllm import LLM, SamplingParams
108

9+
from skythought.evals.models import ModelConfig
10+
from skythought.evals.util.math_parsing_util import strip_answer_string
11+
1112
SUBPROBLEM_SPLIT_PROMPT = """
1213
You are given a reasoning sequence that attempts to solve a math problem.
1314
This sequence contains multiple proposed solutions, then provides a the final solution.

skythought/evals/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
## Requirements
55

6-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage).
6+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage).
77

88
For running OpenAI model, export the OpenAI key.
99
```shell
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
handler: numina
2+
dataset_path: "AI-MO/NuminaMath-CoT"
3+
dataset_subset: null
4+
dataset_split: train
5+
question_key: problem
6+
answer_key: solution
7+
templating_parameters:
8+
template: "Return your final response within \\boxed{{}}. {prompt}"
9+
preprocess_config:
10+
filter_difficulty: true
11+
math_difficulty_lower_bound: 1
12+
math_difficulty_upper_bound: 9
13+
source: amc_aime

skythought/evals/tasks/numina/numina_handler.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,15 @@ def get_difficulty_dict(subset, start, end):
6464
def load_and_filter_dataset(
6565
self, start, end, split=None, subset=None, difficulty=None
6666
):
67-
dataset = self.load_dataset(subset=subset, split=split).to_pandas()
67+
dataset = self.load_dataset(subset=subset, split=split)
6868

69+
if "source" in self.task_config.preprocess_config:
70+
source = self.task_config.preprocess_config["source"]
71+
dataset = dataset.filter(lambda x: x["source"] == source)
72+
73+
dataset = dataset.to_pandas()
74+
# TODO (sumanthrh): this is hacky for numina. the start and end filter should be applied at the very end
75+
# it is kept here for consistency with the original code.
6976
dataset = dataset.iloc[start:end] if end > 0 else dataset.iloc[start:]
7077
dataset = dataset[dataset["solution"].str.contains("boxed", na=False)]
7178

0 commit comments

Comments
 (0)