NovaSky-AI · SumanthRH · Mar 20, 2025 · Feb 24, 2025 · Feb 25, 2025 · Feb 26, 2025
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -5,12 +5,12 @@ repos:
       - id: ruff
         args: [ --fix, --exit-non-zero-on-fix ]
         # NOTE (sumanthrh): Many of the files excluded here are used for validating code generation, and linters do not recognize some of the logic in these files. skythought/train is excluded for now because it's a fork of Llamafactory
-        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py|tasks/taco/taco_util\.py|tasks/apps/apps_util\.py|scripts/prompts\.py|skythought/test-time-scaling/.*)$
+        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|pyext2\.py|taco_util\.py|apps_util\.py|scripts/prompts\.py|skythought/test-time-scaling/.*)$
 
 
   # Black needs to be ran after ruff with --fix
   - repo: https://github.com/psf/black
     rev: 24.10.0
     hooks:
       - id: black
-        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py|skythought/test-time-scaling/.*)$
+        exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|pyext2\.py|skythought/test-time-scaling/.*)$
diff --git a/README.md b/README.md
@@ -38,7 +38,7 @@
 
 We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
 - [`recipes`](./recipes/): Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash`, `Sky-T1-32B-Preview` and `Sky-T1-7B` series. 
-- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library. 
+- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library. We provide a convenient CLI for evaluation as well as a `Scorer` API for scoring during data curation and training ([example](./examples/scoring.ipynb)). 
 - [`skythought/train`](./skythought/train/): Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. 
 - [`skythought/skythought-rl`](./skythought/skythought-rl/): RL training code for Sky-T1-7B and Sky-T1-mini.
 
@@ -125,7 +125,7 @@ We also evaluate on non-reasoning benchmarks (these are benchmarks for instructi
 
 For more details, refer [here](./skythought/evals/base_instruct_evals.md).
 
-## Fully Open-source: Driving Progress Together
+# Fully Open-source: Driving Progress Together
 We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results *easily*:
 
 <table>

diff --git a/examples/scoring.ipynb b/examples/scoring.ipynb
@@ -0,0 +1,204 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Skythought Scoring: Unified APIs for data curation, training and evaluation\n",
+    "\n",
+    "This notebook will provide a quick overview of the `Scorer` API in Skythought. A `Scorer` is a lightweight class that deals with scoring model response for a given task. Skythought provides a set of pre-defined scoring functions for verifiable domains (math, coding, etc), making it easy to use consistent scoring across curation, training and evaluation. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Installation and Setup\n",
+    "\n",
+    "First, make sure you've installed the latest changes from source:\n",
+    "\n",
+    "#### Installing from source\n",
+    "\n",
+    "\n",
+    "```shell\n",
+    "# Clone the repository\n",
+    "git clone https://github.com/NovaSky-AI/SkyThought.git\n",
+    "cd SkyThought\n",
+    "\n",
+    "# Create and activate a virtual environment (using uv here)\n",
+    "uv venv --python 3.10\n",
+    "source .venv/bin/activate\n",
+    "\n",
+    "# Install the package in editable mode\n",
+    "uv pip install -e .\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example Usage during Data Curation\n",
+    "\n",
+    "Here's an example recipe for data curation:\n",
+    "\n",
+    "1. Create a dataset combining the “hard’ subset of NUMINA and the GSM8K dataset . \n",
+    "2. Perform rejection sampling with the base model.  \n",
+    "    a. Obtain a response for each sample and filter out the incorrect responses.   \n",
+    "    b. For scoring, we will combine two functions: a correctness check for math responses like math verify along with a format scorer to make sure the model is adhering to instructions.   \n",
+    "\n",
+    "\n",
+    "```python\n",
+    "import ray\n",
+    "from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig\n",
+    "from datasets import load_dataset\n",
+    "from skythought.evals.scoring import Scorer, MathEqualScorer\n",
+    "import re\n",
+    "import os \n",
+    "\n",
+    "SYSTEM_PROMPT = \"Think step-by-step and provide the final answer in \\\\boxed{}\"\n",
+    "MAX_TOKENS = 2048 \n",
+    "\n",
+    "class FormatScorer(Scorer):\n",
+    "    SCORE_COLUMN = \"format_score\"\n",
+    "    def __init__(self, response_column):\n",
+    "        self.response_column = response_column\n",
+    "\n",
+    "    def score(self, row):\n",
+    "        pat1 = \"<think>(.*)</think>\"\n",
+    "        pat2 = \"\\\\boxed{(.*)}\"\n",
+    "        text = row[self.response_column]\n",
+    "        match1 = re.search(pat1, text)\n",
+    "        match2 = re.search(pat2, text)\n",
+    "        # if even one of the patterns is not found, return 0\n",
+    "        if not match1 or not match2:\n",
+    "            passed = False\n",
+    "        passed = True\n",
+    "        return {self.SCORE_COLUMN: passed}\n",
+    "\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "\n",
+    "    # limit the number of samples per dataset for testing\n",
+    "    num_samples = 20\n",
+    "\n",
+    "    save_dir = \"my_results_dir\"\n",
+    "    \n",
+    "    numina_hf = load_dataset(\"AI-MO/NuminaMath-CoT\", split=\"train\")\n",
+    "    gsm8k_hf = load_dataset(\"openai/gsm8k\", \"main\", split=\"train\")\n",
+    "    \n",
+    "    # filter hard problems and rename to match GSM8K's format\n",
+    "    ds1 = ray.data.from_huggingface(numina_hf) \\\n",
+    "        .filter(expr=\"source == 'hard'\")\\\n",
+    "        .rename_columns({\"problem\": \"question\", \"solution\": \"answer\"}) \\\n",
+    "        .drop_columns([\"source\"]).limit(num_samples)\n",
+    "\n",
+    "    ds2 = ray.data.from_huggingface(gsm8k_hf).limit(num_samples)\n",
+    "\n",
+    "    ds = ds1.union(ds2)\n",
+    "\n",
+    "    llm = build_llm_processor(\n",
+    "        vLLMEngineProcessorConfig(\n",
+    "            model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
+    "            engine_kwargs=dict(\n",
+    "                tensor_parallel_size=2\n",
+    "            ),\n",
+    "            batch_size=64,\n",
+    "            concurrency=2,\n",
+    "        ),\n",
+    "        preprocess=lambda row: dict(\n",
+    "            messages=[\n",
+    "                {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "                {\"role\": \"user\", \"content\": row[\"question\"]},\n",
+    "            ],\n",
+    "            sampling_params=dict(\n",
+    "                temperature=0,\n",
+    "                max_tokens=MAX_TOKENS,\n",
+    "            ),\n",
+    "        )\n",
+    "    )\n",
+    "    # generates responses and saves it in \"generated_text\" column\n",
+    "    ds = llm(ds)\n",
+    "\n",
+    "    ds = ds.map(\n",
+    "        MathEqualScorer, \n",
+    "\t    fn_constructor_kwargs= dict(\n",
+    "            response_column=\"generated_text\", answer_column=\"answer\"\n",
+    "        ),\n",
+    "        concurrency=5\n",
+    "    )\n",
+    "\n",
+    "    ds = ds.map(\n",
+    "        FormatScorer, \n",
+    "        fn_constructor_kwargs= dict(\n",
+    "            response_column=\"generated_text\"\n",
+    "        ),\n",
+    "        concurrency=5\n",
+    "    )\n",
+    "\n",
+    "    ds = ds.filter(expr=\"math_equal_score and format_score\")\n",
+    "    \n",
+    "    ds.write_parquet(os.path.abspath(save_dir))\n",
+    "\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Example Usage During Training\n",
+    "\n",
+    "Given below is an example of creating a custom scorer for training for the dataset used in TULU-3's RLVR stage (a mix of GSM8K, IFEval and MATH)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "...\n",
+    "from skythought.scoring import MathVerifyScorer, GSM8KScorer, IFEvalScorer, Scorer\n",
+    "\n",
+    "# Custom Scoring function for a mix of GSM8K, MATH and IFEval \n",
+    "class MyScorer(Scorer):\n",
+    "\tSCORE_COLUMN = \"score\"\n",
+    "\tdef __init__(self, source_column, response_column, output_column):\n",
+    "\t\tself.source_column = source_column\n",
+    "\t\tself.response_column = response_column\n",
+    "\t\tself.output_column = output_column\n",
+    "\t\tself.gsm8k = GSM8KScorer(response_column, output_column)\n",
+    "\t\tself.ifeval = IFEvalScorer(response_column, output_column)\n",
+    "\t\tself.math = MathVerifyScorer(response_column, output_column)\n",
+    "\n",
+    "\tdef score(self, row):\n",
+    "\t\tsource = row[self.source_column]\n",
+    "\t\tif source == \"gsm8k\": \n",
+    "\t\t\treturn {self.SCORE_COLUMN: self.gsm8k(row)}\n",
+    "\t\telif source == \"math\": \n",
+    "\t\t\treturn {self.SCORE_COLUMN: self.math(row)}\n",
+    "\t\telif source == \"ifeval\":\n",
+    "\t\t\treturn {self.SCORE_COLUMN: self.ifeval(row)}\n",
+    "\t\telse:\n",
+    "\t\t\traise ValueError\n",
+    "\n",
+    "def main(args):\n",
+    "    dataset_args, training_args = parse_args(args)\n",
+    "    ...\n",
+    "    train_dataset = prepare_dataset(train_dataset, tokenizer)\n",
+    "    eval_dataset = prepare_dataset(eval_dataset, tokenizer)\n",
+    "    # assume that the trainer will provide inputs as a single dict. if not, you can customize the interface for the scorer\n",
+    "\t# you can use `.score` or the __call__ interface to get the scores\n",
+    "    reward_function = MyScorer(\"id\", \"response\", \"ground_truth\")\n",
+    "```"
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/pyproject.toml b/pyproject.toml
@@ -8,7 +8,6 @@ authors = [
 requires-python = ">=3.9,<3.11"
 dependencies = [
     "vllm==0.7.0",
-    "pyext",
     "word2number",
     "scipy",
     "datasets",
@@ -45,7 +44,7 @@ skythought = ["evals/**/*.yaml", "evals/**/*.yml"]
 skythought = "skythought.evals.cli:main"
 
 [project.optional-dependencies]
-dev = ["pytest", "pytest-mock", "black", "ruff", "pre-commit"]
+dev = ["pytest", "pytest-mock", "pytest-asyncio", "black", "ruff", "pre-commit"]
 
 [tool.ruff]
 line-length = 160

diff --git a/recipes/sky-t1-preview/__init__.py b/recipes/sky-t1-preview/__init__.py
diff --git a/recipes/sky-t1-preview/postprocess.py b/recipes/sky-t1-preview/postprocess.py
@@ -0,0 +1,43 @@
+from typing import Any, Dict
+
+STILL2_SYSTEM_PROMPT = "Your role as an assistant involves thoroughly exploring questions through a systematic long \
+thinking process before providing the final precise and accurate solutions. This requires \
+engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, \
+backtracing, and iteration to develop well-considered thinking process. \
+Please structure your response into two main sections: Thought and Solution. \
+In the Thought section, detail your reasoning process using the specified format: \
+<|begin_of_thought|> {thought with steps separated with '\n\n'} \
+<|end_of_thought|> \
+Each step should include detailed considerations such as analisying questions, summarizing \
+relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining \
+any errors, and revisiting previous steps. \
+In the Solution section, based on various attempts, explorations, and reflections from the Thought \
+section, systematically present the final solution that you deem correct. The solution should \
+remain a logical, accurate, concise expression style and detail necessary step needed to reach the \
+conclusion, formatted as follows: \
+<|begin_of_solution|> \
+{final formatted, precise, and clear solution} \
+<|end_of_solution|> \
+Now, try to solve the following question through the above guidelines:"
+
+
+def convert_to_sharegpt_format(row: Dict[str, Any], prompt_column, response_column):
+    prompt = row[prompt_column]
+    # Create the conversation format
+    conversations = [
+        {"from": "user", "value": prompt},
+        {
+            "from": "assistant",
+            "value": row[response_column],
+        },
+    ]
+
+    # Prepare the final structure
+    cur_data = {
+        "system": STILL2_SYSTEM_PROMPT,
+        "conversations": conversations,
+        # TODO: remove this
+        **row,
+    }
+
+    return cur_data
diff --git a/recipes/sky-t1-preview/preprocess.py b/recipes/sky-t1-preview/preprocess.py
@@ -0,0 +1,87 @@
+import json
+
+import pyarrow as pa
+from ray.data import Schema
+
+
+class APPSPreprocessor:
+    WITH_FN_NAME_TEMPLATE = "Generate an executable Python function generated from the given prompt. The function should take stdin as input and print the output. Simply call the function after the definition. {prompt}"  # noqa: E501
+
+    WITHOUT_FN_NAME_TEMPLATE = "Generate an executable Python function generated from the given prompt. Return the function body without invoking it at the final solution. {prompt}"  # noqa: E501
+
+    WITH_STARTER_CODE_TEMPLATE = "{input}\n{starter_code}"
+
+    def __call__(self, row):
+        test_case = json.loads(row["input_output"])
+        starter_code = row["starter_code"]
+        prompt = row["question"]
+        if not test_case.get("fn_name"):
+            _input = self.WITH_FN_NAME_TEMPLATE.format(prompt=prompt)
+        else:
+            _input = self.WITHOUT_FN_NAME_TEMPLATE.format(prompt=prompt)
+
+        if starter_code is not None:
+            _input = self.WITH_STARTER_CODE_TEMPLATE.format(
+                input=_input, starter_code=starter_code
+            )
+
+        return {**row, "user_input": _input}
+
+
+class TACOPreprocessor:
+    INITIAL_TEMPLATE = "\nQUESTION:\n{prompt}"
+    STARTER_CODE_TEMPLATE = "{input}\n{starter_code}"
+    STDIN_TEMPLATE = "{input}\nUse Standard Input format\nANSWER:\n"
+    CALL_TEMPLATE = "{input}\nUse Call-Based format\nANSWER:\n"
+
+    def __call__(self, problem):
+
+        prompt = problem["question"]
+        starter_code = (
+            None if len(problem["starter_code"]) == 0 else problem["starter_code"]
+        )
+        try:
+            input_outpout = json.loads(problem["input_output"])
+            fn_name = (
+                None if not input_outpout.get("fn_name") else input_outpout["fn_name"]
+            )
+        except ValueError:
+            fn_name = None
+
+        _input = self.INITIAL_TEMPLATE.format(prompt=prompt)
+
+        if starter_code:
+            _input = self.STARTER_CODE_TEMPLATE.format(
+                input=_input, starter_code=starter_code
+            )
+        else:
+            _input = self.INITIAL_TEMPLATE.format(prompt=prompt)
+        if (not fn_name) and (not starter_code):
+            _input = self.STDIN_TEMPLATE.format(input=_input)
+        else:
+            _input = self.CALL_TEMPLATE.format(input=_input)
+
+        return {**problem, "user_input": _input}
+
+
+class NUMINAPreprocessor:
+    TEMPLATE = "Return your final response within \\boxed{{}}. {prompt}"
+
+    def __call__(self, row):
+        prompt = row["problem"]
+        _input = self.TEMPLATE.format(prompt=prompt)
+        return {**row, "user_input": _input}
+
+
+def taco_coerce_types(row, schema: Schema):
+    for key, schema_type in zip(schema.names, schema.types):
+        value = pa.array([row[key]])
+        if value.type != schema_type:
+            if schema_type == pa.string():
+                try:
+                    row[key] = str(row[key])
+                except Exception:
+                    row[key] = ""
+            elif schema_type == pa.null():
+                row[key] = None
+    return row