fix docs; add numina tasks

SumanthRH · SumanthRH · commit 7a508ce6df96 · 2025-02-20T23:40:56.000-08:00
Signed-off-by: SumanthRH &lt;sumanthrh@anyscale.com&gt;
diff --git a/.github/workflows/cpu_ci.yml b/.github/workflows/cpu_ci.yml
@@ -26,7 +26,7 @@ jobs:
           cache: 'pip'
       - name: Install dependencies
         run: python -m pip install --upgrade pip setuptools wheel pre-commit
-      - name: Install skythought_evals
+      - name: Install skythought
         run: python -m pip install -e ".[dev]"
       - name: Run pre-commit hooks
         run: pre-commit run --all-files --config .pre-commit-config.yaml
@@ -46,7 +46,7 @@ jobs:
           cache: 'pip'
       - name: Install dependencies
         run: python -m pip install --upgrade pip setuptools wheel pre-commit pytest
-      - name: Install skythought_evals
+      - name: Install skythought
         run: python -m pip install -e ".[dev]"
       - name: Run tests
         run: python -m pytest tests/
diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@
 
 We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
 - [`recipes`](./recipes/): Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash`, `Sky-T1-32B-Preview` and `Sky-T1-7B` series. 
-- [`skythought/skythought_evals`](./skythought/skythought_evals/): Our data generation and evaluation library. 
+- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library. 
 - [`skythought/train`](./skythought/train/): Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. 
 - [`skythought/skythought-rl`](./skythought/skythought-rl/): RL training code for Sky-T1-7B and Sky-T1-mini.
 
@@ -79,7 +79,7 @@ We support a wide variety of datasets in mathematics, science and coding:
 - GSM8K
 - AIME'25
 
-For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/skythought_evals/README.md).
+For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/evals/README.md).
 
 
 ### Evaluation results
@@ -112,7 +112,7 @@ We also evaluate on non-reasoning benchmarks (these are benchmarks for instructi
 | BFCL-v3 | 53.18 | **58.92** | 17.41 | [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) |
 | Arena-Hard | **74.79** | 66.51 | 52.6 | [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) |
 
-For more details, refer [here](./skythought/skythought_evals/base_instruct_evals.md).
+For more details, refer [here](./skythought/evals/base_instruct_evals.md).
 
 ## Fully Open-source: Driving Progress Together
 We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results *easily*:
diff --git a/examples/evaluate.ipynb b/examples/evaluate.ipynb
@@ -94,7 +94,7 @@
     "1. Quick Start\n",
     "\n",
     "```bash\n",
-    "skythought_evals evaluate \\\n",
+    "skythought evaluate \\\n",
     "--task aime24 \\ \n",
     "--model  NovaSky-AI/Sky-T1-32B-Preview \\\n",
     "--backend vllm \\\n",
@@ -124,7 +124,7 @@
     "### Key Concepts\n",
     "\n",
     "-  Task: A task is an evaluation dataset. We use the `task` argument to retrieve the corresponding configuration file from our pre-configured benchmarks (To see the available tasks, use `skythought evaluate --help`) \n",
-    "-  Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
+    "-  Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
     "- Backend: The Backend is concerned with how the LLM instance is created and queried. We support a variety of backends via the `backend` argument. \n",
     "    - The `openai` backend can be used to query OpenAI-compatible endpoints. Example: `--backend openai --backend-args base_url=https://api.openai.com`\n",
     "    - The `vllm` backend instantiates a local model instance with [vLLM](docs.vllm.ai) for efficient inference. \n",
@@ -154,7 +154,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/skythought_evals/README.md)"
+    "For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/evals/README.md)"
    ]
   }
  ],
diff --git a/recipes/sky-t1-7b/README.md b/recipes/sky-t1-7b/README.md
@@ -3,23 +3,58 @@ For a detailed training recipes and technical details, refer to the [blog](https
 
 ## SFT: Step 1 and Step 3 SFT
 ### Distillation Data Mixture
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
 
 For the distillation performed in step 1 and step 3, we use the following script for data generation. Replace the `$MODEL_NAME` with the model to be distilled from.
+For each subset (`numina_math`, `numina_olympaids`, etc), the `score` command requires the output directory from the previous `generate` command. 
+
 ```shell
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --inference
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --check
+skythought generate \
+    --task numina_math \
+    --model $MODEL_NAME \
+    --backend vllm \
+    --backend-args tensor_parallel_size=4 \
+    --sampling-params max_tokens=16384 \
+    --result-dir ./data
 
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --inference
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --check
+skythought score \
+    --task numina_math \
+    --run-dir <path to output folder from generate>
+```
 
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --inference
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --check
+```shell
+skythought generate  \
+    --task numina_olympiads \
+    --model $MODEL_NAME \
+    --backend vllm \
+    --backend-args tensor_parallel_size=4 \
+    --sampling-params max_tokens=16384 \
+    --end 40000 \
+    --result-dir ./data 
+
+skythought score \
+    --task numina_olympiads \
+    --run-dir <path to output folder from generate>
 ```
+
+```shell
+skythought generate  \
+    --task numina_amc_aime \
+    --model $MODEL_NAME \
+    --backend vllm \
+    --backend-args tensor_parallel_size=4 \
+    --sampling-params max_tokens=16384 \
+    --result-dir ./data 
+
+skythought score \
+    --task numina_math \
+    --run-dir <path to output folder from generate>
+```
+
 For step 1 and step 3 SFT, follow the instructions in `skythought/train`.
 
 ## RL: Step 2 and Step 4 RL
 For RL training, install our modified fork of [VeRL](https://github.com/volcengine/verl) under `skythought/skythought-rl` and follow the instructions there, we also incorporate the math and coding testing utils from the [PRIME](https://github.com/PRIME-RL/PRIME) repo.
 
 ## Evaluation
-For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).
+For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).
diff --git a/recipes/sky-t1-flash/README.md b/recipes/sky-t1-flash/README.md
@@ -6,7 +6,7 @@ For a detailed breakdown of the duration curation steps and training methodology
 
 ## Setup
 
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
 
 
 ## Stage 1: Data Generation
diff --git a/recipes/sky-t1-preview/README.md b/recipes/sky-t1-preview/README.md
@@ -6,7 +6,7 @@ Give below are the instructions to replicate the data preprocessing and training
 
 ## Setup
 
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
 Set the env variable `SKYT_HOME` as the directory for the final dataset. 
 
 ## Training Data Curation
diff --git a/scripts/response_rewrite.py b/scripts/response_rewrite.py
@@ -3,11 +3,12 @@
 import os
 import random
 
-from skythought_evals.models import ModelConfig
-from skythought_evals.util.math_parsing_util import strip_answer_string
 from tqdm import tqdm
 from vllm import LLM, SamplingParams
 
+from skythought.evals.models import ModelConfig
+from skythought.evals.util.math_parsing_util import strip_answer_string
+
 SUBPROBLEM_SPLIT_PROMPT = """
   You are given a reasoning sequence that attempts to solve a math problem.
   This sequence contains multiple proposed solutions, then provides a the final solution. 
diff --git a/skythought/evals/README.md b/skythought/evals/README.md
@@ -3,7 +3,7 @@
 
 ## Requirements 
 
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage).
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage).
 
 For running OpenAI model, export the OpenAI key. 
 ```shell
diff --git a/skythought/evals/tasks/numina/numina_amc_aime.yaml b/skythought/evals/tasks/numina/numina_amc_aime.yaml
@@ -0,0 +1,13 @@
+handler: numina
+dataset_path: "AI-MO/NuminaMath-CoT"
+dataset_subset: null
+dataset_split: train
+question_key: problem
+answer_key: solution
+templating_parameters:
+  template: "Return your final response within \\boxed{{}}. {prompt}"
+preprocess_config:
+  filter_difficulty: true
+  math_difficulty_lower_bound: 1
+  math_difficulty_upper_bound: 9
+  source: amc_aime
diff --git a/skythought/evals/tasks/numina/numina_handler.py b/skythought/evals/tasks/numina/numina_handler.py
@@ -64,8 +64,15 @@ def get_difficulty_dict(subset, start, end):
     def load_and_filter_dataset(
         self, start, end, split=None, subset=None, difficulty=None
     ):
-        dataset = self.load_dataset(subset=subset, split=split).to_pandas()
+        dataset = self.load_dataset(subset=subset, split=split)
 
+        if "source" in self.task_config.preprocess_config:
+            source = self.task_config.preprocess_config["source"]
+            dataset = dataset.filter(lambda x: x["source"] == source)
+
+        dataset = dataset.to_pandas()
+        # TODO (sumanthrh): this is hacky for numina. the start and end filter should be applied at the very end
+        # it is kept here for consistency with the original code.
         dataset = dataset.iloc[start:end] if end > 0 else dataset.iloc[start:]
         dataset = dataset[dataset["solution"].str.contains("boxed", na=False)]
 
diff --git a/skythought/evals/tasks/numina/numina_math.yaml b/skythought/evals/tasks/numina/numina_math.yaml
@@ -0,0 +1,13 @@
+handler: numina
+dataset_path: "AI-MO/NuminaMath-CoT"
+dataset_subset: null
+dataset_split: train
+question_key: problem
+answer_key: solution
+templating_parameters:
+  template: "Return your final response within \\boxed{{}}. {prompt}"
+preprocess_config:
+  filter_difficulty: true
+  math_difficulty_lower_bound: 4
+  math_difficulty_upper_bound: 9
+  source: math
diff --git a/skythought/evals/tasks/numina/numina_olympiads.yaml b/skythought/evals/tasks/numina/numina_olympiads.yaml
@@ -0,0 +1,13 @@
+handler: numina
+dataset_path: "AI-MO/NuminaMath-CoT"
+dataset_subset: null
+dataset_split: train
+question_key: problem
+answer_key: solution
+templating_parameters:
+  template: "Return your final response within \\boxed{{}}. {prompt}"
+preprocess_config:
+  filter_difficulty: true
+  math_difficulty_lower_bound: 9
+  math_difficulty_upper_bound: 9
+  source: olympiads