Skip to content

Commit 7a508ce

Browse files
committed
fix docs; add numina tasks
Signed-off-by: SumanthRH <[email protected]>
1 parent 3deec3b commit 7a508ce

File tree

12 files changed

+104
-22
lines changed

12 files changed

+104
-22
lines changed

.github/workflows/cpu_ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ jobs:
2626
cache: 'pip'
2727
- name: Install dependencies
2828
run: python -m pip install --upgrade pip setuptools wheel pre-commit
29-
- name: Install skythought_evals
29+
- name: Install skythought
3030
run: python -m pip install -e ".[dev]"
3131
- name: Run pre-commit hooks
3232
run: pre-commit run --all-files --config .pre-commit-config.yaml
@@ -46,7 +46,7 @@ jobs:
4646
cache: 'pip'
4747
- name: Install dependencies
4848
run: python -m pip install --upgrade pip setuptools wheel pre-commit pytest
49-
- name: Install skythought_evals
49+
- name: Install skythought
5050
run: python -m pip install -e ".[dev]"
5151
- name: Run tests
5252
run: python -m pytest tests/

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737

3838
We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
3939
- [`recipes`](./recipes/): Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash`, `Sky-T1-32B-Preview` and `Sky-T1-7B` series.
40-
- [`skythought/skythought_evals`](./skythought/skythought_evals/): Our data generation and evaluation library.
40+
- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library.
4141
- [`skythought/train`](./skythought/train/): Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training.
4242
- [`skythought/skythought-rl`](./skythought/skythought-rl/): RL training code for Sky-T1-7B and Sky-T1-mini.
4343

@@ -79,7 +79,7 @@ We support a wide variety of datasets in mathematics, science and coding:
7979
- GSM8K
8080
- AIME'25
8181

82-
For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/skythought_evals/README.md).
82+
For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/evals/README.md).
8383

8484

8585
### Evaluation results
@@ -112,7 +112,7 @@ We also evaluate on non-reasoning benchmarks (these are benchmarks for instructi
112112
| BFCL-v3 | 53.18 | **58.92** | 17.41 | [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) |
113113
| Arena-Hard | **74.79** | 66.51 | 52.6 | [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) |
114114

115-
For more details, refer [here](./skythought/skythought_evals/base_instruct_evals.md).
115+
For more details, refer [here](./skythought/evals/base_instruct_evals.md).
116116

117117
## Fully Open-source: Driving Progress Together
118118
We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results *easily*:

examples/evaluate.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@
9494
"1. Quick Start\n",
9595
"\n",
9696
"```bash\n",
97-
"skythought_evals evaluate \\\n",
97+
"skythought evaluate \\\n",
9898
"--task aime24 \\ \n",
9999
"--model NovaSky-AI/Sky-T1-32B-Preview \\\n",
100100
"--backend vllm \\\n",
@@ -124,7 +124,7 @@
124124
"### Key Concepts\n",
125125
"\n",
126126
"- Task: A task is an evaluation dataset. We use the `task` argument to retrieve the corresponding configuration file from our pre-configured benchmarks (To see the available tasks, use `skythought evaluate --help`) \n",
127-
"- Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
127+
"- Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
128128
"- Backend: The Backend is concerned with how the LLM instance is created and queried. We support a variety of backends via the `backend` argument. \n",
129129
" - The `openai` backend can be used to query OpenAI-compatible endpoints. Example: `--backend openai --backend-args base_url=https://api.openai.com`\n",
130130
" - The `vllm` backend instantiates a local model instance with [vLLM](docs.vllm.ai) for efficient inference. \n",
@@ -154,7 +154,7 @@
154154
"cell_type": "markdown",
155155
"metadata": {},
156156
"source": [
157-
"For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/skythought_evals/README.md)"
157+
"For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/evals/README.md)"
158158
]
159159
}
160160
],

recipes/sky-t1-7b/README.md

Lines changed: 43 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,58 @@ For a detailed training recipes and technical details, refer to the [blog](https
33

44
## SFT: Step 1 and Step 3 SFT
55
### Distillation Data Mixture
6-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
6+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
77

88
For the distillation performed in step 1 and step 3, we use the following script for data generation. Replace the `$MODEL_NAME` with the model to be distilled from.
9+
For each subset (`numina_math`, `numina_olympaids`, etc), the `score` command requires the output directory from the previous `generate` command.
10+
911
```shell
10-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --inference
11-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --check
12+
skythought generate \
13+
--task numina_math \
14+
--model $MODEL_NAME \
15+
--backend vllm \
16+
--backend-args tensor_parallel_size=4 \
17+
--sampling-params max_tokens=16384 \
18+
--result-dir ./data
1219

13-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --inference
14-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --check
20+
skythought score \
21+
--task numina_math \
22+
--run-dir <path to output folder from generate>
23+
```
1524

16-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --inference
17-
python -m skythought_evals.inference_and_check --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --check
25+
```shell
26+
skythought generate \
27+
--task numina_olympiads \
28+
--model $MODEL_NAME \
29+
--backend vllm \
30+
--backend-args tensor_parallel_size=4 \
31+
--sampling-params max_tokens=16384 \
32+
--end 40000 \
33+
--result-dir ./data
34+
35+
skythought score \
36+
--task numina_olympiads \
37+
--run-dir <path to output folder from generate>
1838
```
39+
40+
```shell
41+
skythought generate \
42+
--task numina_amc_aime \
43+
--model $MODEL_NAME \
44+
--backend vllm \
45+
--backend-args tensor_parallel_size=4 \
46+
--sampling-params max_tokens=16384 \
47+
--result-dir ./data
48+
49+
skythought score \
50+
--task numina_math \
51+
--run-dir <path to output folder from generate>
52+
```
53+
1954
For step 1 and step 3 SFT, follow the instructions in `skythought/train`.
2055

2156
## RL: Step 2 and Step 4 RL
2257
For RL training, install our modified fork of [VeRL](https://github.com/volcengine/verl) under `skythought/skythought-rl` and follow the instructions there, we also incorporate the math and coding testing utils from the [PRIME](https://github.com/PRIME-RL/PRIME) repo.
2358

2459
## Evaluation
25-
For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).
60+
For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).

recipes/sky-t1-flash/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ For a detailed breakdown of the duration curation steps and training methodology
66

77
## Setup
88

9-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
9+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
1010

1111

1212
## Stage 1: Data Generation

recipes/sky-t1-preview/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ Give below are the instructions to replicate the data preprocessing and training
66

77
## Setup
88

9-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
9+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
1010
Set the env variable `SKYT_HOME` as the directory for the final dataset.
1111

1212
## Training Data Curation

scripts/response_rewrite.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,12 @@
33
import os
44
import random
55

6-
from skythought_evals.models import ModelConfig
7-
from skythought_evals.util.math_parsing_util import strip_answer_string
86
from tqdm import tqdm
97
from vllm import LLM, SamplingParams
108

9+
from skythought.evals.models import ModelConfig
10+
from skythought.evals.util.math_parsing_util import strip_answer_string
11+
1112
SUBPROBLEM_SPLIT_PROMPT = """
1213
You are given a reasoning sequence that attempts to solve a math problem.
1314
This sequence contains multiple proposed solutions, then provides a the final solution.

skythought/evals/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
## Requirements
55

6-
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage).
6+
Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage).
77

88
For running OpenAI model, export the OpenAI key.
99
```shell
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
handler: numina
2+
dataset_path: "AI-MO/NuminaMath-CoT"
3+
dataset_subset: null
4+
dataset_split: train
5+
question_key: problem
6+
answer_key: solution
7+
templating_parameters:
8+
template: "Return your final response within \\boxed{{}}. {prompt}"
9+
preprocess_config:
10+
filter_difficulty: true
11+
math_difficulty_lower_bound: 1
12+
math_difficulty_upper_bound: 9
13+
source: amc_aime

skythought/evals/tasks/numina/numina_handler.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,8 +64,15 @@ def get_difficulty_dict(subset, start, end):
6464
def load_and_filter_dataset(
6565
self, start, end, split=None, subset=None, difficulty=None
6666
):
67-
dataset = self.load_dataset(subset=subset, split=split).to_pandas()
67+
dataset = self.load_dataset(subset=subset, split=split)
6868

69+
if "source" in self.task_config.preprocess_config:
70+
source = self.task_config.preprocess_config["source"]
71+
dataset = dataset.filter(lambda x: x["source"] == source)
72+
73+
dataset = dataset.to_pandas()
74+
# TODO (sumanthrh): this is hacky for numina. the start and end filter should be applied at the very end
75+
# it is kept here for consistency with the original code.
6976
dataset = dataset.iloc[start:end] if end > 0 else dataset.iloc[start:]
7077
dataset = dataset[dataset["solution"].str.contains("boxed", na=False)]
7178

0 commit comments

Comments
 (0)