fix docs; add numina tasks (#79)

SumanthRH · web-flow · commit 4c2085a0428d · 2025-02-21T00:02:30.000-08:00
- Fixes some broken links after #77 . `skythought_evals` has been renamed to `evals` and the package name is `skythought`. - Added separate yamls for Numina for a better quickstart experience. Ideally, we shouldn't have to keep adding yamls for all the training datasets in the evaluation library, and should instead provide APIs for standalone scripts. For now we do this to support reproduction of Sky-T1 models. Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
diff --git a/.github/workflows/cpu_ci.yml b/.github/workflows/cpu_ci.yml
@@ -26,7 +26,7 @@ jobs:
           cache: 'pip'
       - name: Install dependencies
         run: python -m pip install --upgrade pip setuptools wheel pre-commit
-      - name: Install skythought_evals
+      - name: Install skythought
         run: python -m pip install -e ".[dev]"
       - name: Run pre-commit hooks
         run: pre-commit run --all-files --config .pre-commit-config.yaml
@@ -46,7 +46,7 @@ jobs:
           cache: 'pip'
       - name: Install dependencies
         run: python -m pip install --upgrade pip setuptools wheel pre-commit pytest
-      - name: Install skythought_evals
+      - name: Install skythought
         run: python -m pip install -e ".[dev]"
       - name: Run tests
         run: python -m pytest tests/
diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@
 
 We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
 - [`recipes`](./recipes/): Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash`, `Sky-T1-32B-Preview` and `Sky-T1-7B` series. 
-- [`skythought/skythought_evals`](./skythought/skythought_evals/): Our data generation and evaluation library. 
+- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library. 
 - [`skythought/train`](./skythought/train/): Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. 
 - [`skythought/skythought-rl`](./skythought/skythought-rl/): RL training code for Sky-T1-7B and Sky-T1-mini.
 
@@ -79,7 +79,7 @@ We support a wide variety of datasets in mathematics, science and coding:
 - GSM8K
 - AIME'25
 
-For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/skythought_evals/README.md).
+For more details, please refer to our [evaluation guide](examples/evaluate.ipynb) and the [README](skythought/evals/README.md).
 
 
 ### Evaluation results
@@ -112,7 +112,7 @@ We also evaluate on non-reasoning benchmarks (these are benchmarks for instructi
 | BFCL-v3 | 53.18 | **58.92** | 17.41 | [BFCL](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) |
 | Arena-Hard | **74.79** | 66.51 | 52.6 | [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) |
 
-For more details, refer [here](./skythought/skythought_evals/base_instruct_evals.md).
+For more details, refer [here](./skythought/evals/base_instruct_evals.md).
 
 ## Fully Open-source: Driving Progress Together
 We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results *easily*:
diff --git a/examples/evaluate.ipynb b/examples/evaluate.ipynb
@@ -94,7 +94,7 @@
     "1. Quick Start\n",
     "\n",
     "```bash\n",
-    "skythought_evals evaluate \\\n",
+    "skythought evaluate \\\n",
     "--task aime24 \\ \n",
     "--model  NovaSky-AI/Sky-T1-32B-Preview \\\n",
     "--backend vllm \\\n",
@@ -124,7 +124,7 @@
     "### Key Concepts\n",
     "\n",
     "-  Task: A task is an evaluation dataset. We use the `task` argument to retrieve the corresponding configuration file from our pre-configured benchmarks (To see the available tasks, use `skythought evaluate --help`) \n",
-    "-  Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
+    "-  Model: A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. We use the `model` argument to retrieve pre-configured templating parameters (system prompt, assistant prefill, etc) for the model, if available. You can find the list of pre-configured models [here](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml). If a model is not available, then we use no system prompt (i.e we default to the system prompt in the chat template, if specified). You can use the `--system-prompt-name` flag to use one of the pre-configured system prompts in the library. To see the available system prompts, use `skythought evaluate --help`. You can also pass the full system prompt via CLI with the `--system-prompt` option. \n",
     "- Backend: The Backend is concerned with how the LLM instance is created and queried. We support a variety of backends via the `backend` argument. \n",
     "    - The `openai` backend can be used to query OpenAI-compatible endpoints. Example: `--backend openai --backend-args base_url=https://api.openai.com`\n",
     "    - The `vllm` backend instantiates a local model instance with [vLLM](docs.vllm.ai) for efficient inference. \n",
@@ -154,7 +154,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/skythought_evals/README.md)"
+    "For more details - such as which configurations are best for performance, how to perform multi-node inference, etc refer to the [README](../skythought/evals/README.md)"
    ]
   }
  ],
diff --git a/recipes/sky-t1-7b/README.md b/recipes/sky-t1-7b/README.md
@@ -3,23 +3,58 @@ For a detailed training recipes and technical details, refer to the [blog](https
 
 ## SFT: Step 1 and Step 3 SFT
 ### Distillation Data Mixture
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
 
 For the distillation performed in step 1 and step 3, we use the following script for data generation. Replace the `$MODEL_NAME` with the model to be distilled from.
+For each subset (`numina_math`, `numina_olympaids`, etc), the `score` command requires the output directory from the previous `generate` command. 
+
 ```shell
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --inference
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source math --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 4 --math-difficulty-upper-bound 9 --check
+skythought generate \
+    --task numina_math \
+    --model $MODEL_NAME \
+    --backend vllm \
+    --backend-args tensor_parallel_size=4 \
+    --sampling-params max_tokens=16384 \
+    --result-dir ./data
 
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --inference
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source olympiads --end 40000 --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 9 --math-difficulty-upper-bound 9 --check
+skythought score \
+    --task numina_math \
+    --run-dir <path to output folder from generate>
+```
 
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --inference
-python -m skythought_evals.inference_and_check  --dataset NUMINA --model $MODEL_NAME --tp 4 --max_tokens 16384 --split train --source amc_aime --filter-difficulty --result-dir ./data --math-difficulty-lower-bound 1 --math-difficulty-upper-bound 9 --check
+```shell
+skythought generate  \
+    --task numina_olympiads \
+    --model $MODEL_NAME \
+    --backend vllm \
+    --backend-args tensor_parallel_size=4 \
+    --sampling-params max_tokens=16384 \
+    --end 40000 \
+    --result-dir ./data 
+
+skythought score \
+    --task numina_olympiads \
+    --run-dir <path to output folder from generate>
 ```
+
+```shell
+skythought generate  \
+    --task numina_amc_aime \
+    --model $MODEL_NAME \
+    --backend vllm \
+    --backend-args tensor_parallel_size=4 \
+    --sampling-params max_tokens=16384 \
+    --result-dir ./data 
+
+skythought score \
+    --task numina_math \
+    --run-dir <path to output folder from generate>
+```
+
 For step 1 and step 3 SFT, follow the instructions in `skythought/train`.
 
 ## RL: Step 2 and Step 4 RL
 For RL training, install our modified fork of [VeRL](https://github.com/volcengine/verl) under `skythought/skythought-rl` and follow the instructions there, we also incorporate the math and coding testing utils from the [PRIME](https://github.com/PRIME-RL/PRIME) repo.
 
 ## Evaluation
-For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/skythought_evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).
+For evaluation, we use the [script](https://github.com/QwenLM/Qwen2.5-Math/blob/main/evaluation/sh/eval.sh) from the [Qwen math eval suite](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation). We use vLLM version `0.6.2` for all evaluations. For AMC and AIME, we use temp=0.6, top_p=0.95 and n_sample=8. After the generation, we calculate the pass@1 using this [script](https://github.com/NovaSky-AI/SkyThought/tree/main/scripts/qwen_eval_bon.py). For MATH500 and OlympiadBench, we use greedy decoding. We use the [skythought system prompt](https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/evals/models/model_configs.yaml) when evaluating all the models trained by us except for Sky-T1-mini which is evaluated [without a system prompt](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B#usage-recommendations).
diff --git a/recipes/sky-t1-flash/README.md b/recipes/sky-t1-flash/README.md
@@ -6,7 +6,7 @@ For a detailed breakdown of the duration curation steps and training methodology
 
 ## Setup
 
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
 
 
 ## Stage 1: Data Generation
diff --git a/recipes/sky-t1-preview/README.md b/recipes/sky-t1-preview/README.md
@@ -6,7 +6,7 @@ Give below are the instructions to replicate the data preprocessing and training
 
 ## Setup
 
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage). All the data curation commands are provided from the root directory of the repo.
 Set the env variable `SKYT_HOME` as the directory for the final dataset. 
 
 ## Training Data Curation
diff --git a/scripts/response_rewrite.py b/scripts/response_rewrite.py
@@ -3,11 +3,12 @@
 import os
 import random
 
-from skythought_evals.models import ModelConfig
-from skythought_evals.util.math_parsing_util import strip_answer_string
 from tqdm import tqdm
 from vllm import LLM, SamplingParams
 
+from skythought.evals.models import ModelConfig
+from skythought.evals.util.math_parsing_util import strip_answer_string
+
 SUBPROBLEM_SPLIT_PROMPT = """
   You are given a reasoning sequence that attempts to solve a math problem.
   This sequence contains multiple proposed solutions, then provides a the final solution. 
diff --git a/skythought/evals/README.md b/skythought/evals/README.md
@@ -3,7 +3,7 @@
 
 ## Requirements 
 
-Make sure you have installed the `skythought-evals` package as outlined in the [README.md](/README.md#usage).
+Make sure you have installed the `skythought` package as outlined in the [README.md](/README.md#usage).
 
 For running OpenAI model, export the OpenAI key. 
 ```shell
diff --git a/skythought/evals/tasks/numina/numina_amc_aime.yaml b/skythought/evals/tasks/numina/numina_amc_aime.yaml
@@ -0,0 +1,13 @@
+handler: numina
+dataset_path: "AI-MO/NuminaMath-CoT"
+dataset_subset: null
+dataset_split: train
+question_key: problem
+answer_key: solution
+templating_parameters:
+  template: "Return your final response within \\boxed{{}}. {prompt}"
+preprocess_config:
+  filter_difficulty: true
+  math_difficulty_lower_bound: 1
+  math_difficulty_upper_bound: 9
+  source: amc_aime
diff --git a/skythought/evals/tasks/numina/numina_handler.py b/skythought/evals/tasks/numina/numina_handler.py
@@ -64,8 +64,15 @@ def get_difficulty_dict(subset, start, end):
     def load_and_filter_dataset(
         self, start, end, split=None, subset=None, difficulty=None
     ):
-        dataset = self.load_dataset(subset=subset, split=split).to_pandas()
+        dataset = self.load_dataset(subset=subset, split=split)
 
+        if "source" in self.task_config.preprocess_config:
+            source = self.task_config.preprocess_config["source"]
+            dataset = dataset.filter(lambda x: x["source"] == source)
+
+        dataset = dataset.to_pandas()
+        # TODO (sumanthrh): this is hacky for numina. the start and end filter should be applied at the very end
+        # it is kept here for consistency with the original code.
         dataset = dataset.iloc[start:end] if end > 0 else dataset.iloc[start:]
         dataset = dataset[dataset["solution"].str.contains("boxed", na=False)]
 
diff --git a/skythought/evals/tasks/numina/numina_math.yaml b/skythought/evals/tasks/numina/numina_math.yaml
@@ -0,0 +1,13 @@
+handler: numina
+dataset_path: "AI-MO/NuminaMath-CoT"
+dataset_subset: null
+dataset_split: train
+question_key: problem
+answer_key: solution
+templating_parameters:
+  template: "Return your final response within \\boxed{{}}. {prompt}"
+preprocess_config:
+  filter_difficulty: true
+  math_difficulty_lower_bound: 4
+  math_difficulty_upper_bound: 9
+  source: math
diff --git a/skythought/evals/tasks/numina/numina_olympiads.yaml b/skythought/evals/tasks/numina/numina_olympiads.yaml
@@ -0,0 +1,13 @@
+handler: numina
+dataset_path: "AI-MO/NuminaMath-CoT"
+dataset_subset: null
+dataset_split: train
+question_key: problem
+answer_key: solution
+templating_parameters:
+  template: "Return your final response within \\boxed{{}}. {prompt}"
+preprocess_config:
+  filter_difficulty: true
+  math_difficulty_lower_bound: 9
+  math_difficulty_upper_bound: 9
+  source: olympiads