Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ repos:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]
# NOTE (sumanthrh): Many of the files excluded here are used for validating code generation, and linters do not recognize some of the logic in these files. skythought/train is excluded for now because it's a fork of Llamafactory
exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py|tasks/taco/taco_util\.py|tasks/apps/apps_util\.py|scripts/prompts\.py|skythought/test-time-scaling/.*)$
exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|pyext2\.py|taco_util\.py|apps_util\.py|scripts/prompts\.py|skythought/test-time-scaling/.*)$


# Black needs to be ran after ruff with --fix
- repo: https://github.com/psf/black
rev: 24.10.0
hooks:
- id: black
exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|tasks/taco/pyext2\.py|skythought/test-time-scaling/.*)$
exclude: (^skythought/train/.*|^skythought/skythought-rl/.*|pyext2\.py|skythought/test-time-scaling/.*)$
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@

We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
- [`recipes`](./recipes/): Recipes - data curation steps and training strategies - for building our models `Sky-T1-32B-Flash`, `Sky-T1-32B-Preview` and `Sky-T1-7B` series.
- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library.
- [`skythought/evals`](./skythought/evals/): Our data generation and evaluation library. We provide a convenient CLI for evaluation as well as a `Scorer` API for scoring during data curation and training ([example](./examples/scoring.ipynb)).
- [`skythought/train`](./skythought/train/): Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training.
- [`skythought/skythought-rl`](./skythought/skythought-rl/): RL training code for Sky-T1-7B and Sky-T1-mini.

Expand Down Expand Up @@ -125,7 +125,7 @@ We also evaluate on non-reasoning benchmarks (these are benchmarks for instructi

For more details, refer [here](./skythought/evals/base_instruct_evals.md).

## Fully Open-source: Driving Progress Together
# Fully Open-source: Driving Progress Together
We believe that open-source collaboration drives progress, and with Sky-T1-32B-Preview, we are fully committed to empowering the community. We open-source all details (i.e., data, codes, model weights) to enable the community to replicate and improve on our results *easily*:

<table>
Expand Down
204 changes: 204 additions & 0 deletions examples/scoring.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Skythought Scoring: Unified APIs for data curation, training and evaluation\n",
"\n",
"This notebook will provide a quick overview of the `Scorer` API in Skythought. A `Scorer` is a lightweight class that deals with scoring model response for a given task. Skythought provides a set of pre-defined scoring functions for verifiable domains (math, coding, etc), making it easy to use consistent scoring across curation, training and evaluation. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installation and Setup\n",
"\n",
"First, make sure you've installed the latest changes from source:\n",
"\n",
"#### Installing from source\n",
"\n",
"\n",
"```shell\n",
"# Clone the repository\n",
"git clone https://github.com/NovaSky-AI/SkyThought.git\n",
"cd SkyThought\n",
"\n",
"# Create and activate a virtual environment (using uv here)\n",
"uv venv --python 3.10\n",
"source .venv/bin/activate\n",
"\n",
"# Install the package in editable mode\n",
"uv pip install -e .\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example Usage during Data Curation\n",
"\n",
"Here's an example recipe for data curation:\n",
"\n",
"1. Create a dataset combining the “hard’ subset of NUMINA and the GSM8K dataset . \n",
"2. Perform rejection sampling with the base model. \n",
" a. Obtain a response for each sample and filter out the incorrect responses. \n",
" b. For scoring, we will combine two functions: a correctness check for math responses like math verify along with a format scorer to make sure the model is adhering to instructions. \n",
"\n",
"\n",
"```python\n",
"import ray\n",
"from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig\n",
"from datasets import load_dataset\n",
"from skythought.evals.scoring import Scorer, MathEqualScorer\n",
"import re\n",
"import os \n",
"\n",
"SYSTEM_PROMPT = \"Think step-by-step and provide the final answer in \\\\boxed{}\"\n",
"MAX_TOKENS = 2048 \n",
"\n",
"class FormatScorer(Scorer):\n",
" SCORE_COLUMN = \"format_score\"\n",
" def __init__(self, response_column):\n",
" self.response_column = response_column\n",
"\n",
" def score(self, row):\n",
" pat1 = \"<think>(.*)</think>\"\n",
" pat2 = \"\\\\boxed{(.*)}\"\n",
" text = row[self.response_column]\n",
" match1 = re.search(pat1, text)\n",
" match2 = re.search(pat2, text)\n",
" # if even one of the patterns is not found, return 0\n",
" if not match1 or not match2:\n",
" passed = False\n",
" passed = True\n",
" return {self.SCORE_COLUMN: passed}\n",
"\n",
"\n",
"if __name__ == \"__main__\":\n",
"\n",
" # limit the number of samples per dataset for testing\n",
" num_samples = 20\n",
"\n",
" save_dir = \"my_results_dir\"\n",
" \n",
" numina_hf = load_dataset(\"AI-MO/NuminaMath-CoT\", split=\"train\")\n",
" gsm8k_hf = load_dataset(\"openai/gsm8k\", \"main\", split=\"train\")\n",
" \n",
" # filter hard problems and rename to match GSM8K's format\n",
" ds1 = ray.data.from_huggingface(numina_hf) \\\n",
" .filter(expr=\"source == 'hard'\")\\\n",
" .rename_columns({\"problem\": \"question\", \"solution\": \"answer\"}) \\\n",
" .drop_columns([\"source\"]).limit(num_samples)\n",
"\n",
" ds2 = ray.data.from_huggingface(gsm8k_hf).limit(num_samples)\n",
"\n",
" ds = ds1.union(ds2)\n",
"\n",
" llm = build_llm_processor(\n",
" vLLMEngineProcessorConfig(\n",
" model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" engine_kwargs=dict(\n",
" tensor_parallel_size=2\n",
" ),\n",
" batch_size=64,\n",
" concurrency=2,\n",
" ),\n",
" preprocess=lambda row: dict(\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
" {\"role\": \"user\", \"content\": row[\"question\"]},\n",
" ],\n",
" sampling_params=dict(\n",
" temperature=0,\n",
" max_tokens=MAX_TOKENS,\n",
" ),\n",
" )\n",
" )\n",
" # generates responses and saves it in \"generated_text\" column\n",
" ds = llm(ds)\n",
"\n",
" ds = ds.map(\n",
" MathEqualScorer, \n",
"\t fn_constructor_kwargs= dict(\n",
" response_column=\"generated_text\", answer_column=\"answer\"\n",
" ),\n",
" concurrency=5\n",
" )\n",
"\n",
" ds = ds.map(\n",
" FormatScorer, \n",
" fn_constructor_kwargs= dict(\n",
" response_column=\"generated_text\"\n",
" ),\n",
" concurrency=5\n",
" )\n",
"\n",
" ds = ds.filter(expr=\"math_equal_score and format_score\")\n",
" \n",
" ds.write_parquet(os.path.abspath(save_dir))\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example Usage During Training\n",
"\n",
"Given below is an example of creating a custom scorer for training for the dataset used in TULU-3's RLVR stage (a mix of GSM8K, IFEval and MATH)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"...\n",
"from skythought.scoring import MathVerifyScorer, GSM8KScorer, IFEvalScorer, Scorer\n",
"\n",
"# Custom Scoring function for a mix of GSM8K, MATH and IFEval \n",
"class MyScorer(Scorer):\n",
"\tSCORE_COLUMN = \"score\"\n",
"\tdef __init__(self, source_column, response_column, output_column):\n",
"\t\tself.source_column = source_column\n",
"\t\tself.response_column = response_column\n",
"\t\tself.output_column = output_column\n",
"\t\tself.gsm8k = GSM8KScorer(response_column, output_column)\n",
"\t\tself.ifeval = IFEvalScorer(response_column, output_column)\n",
"\t\tself.math = MathVerifyScorer(response_column, output_column)\n",
"\n",
"\tdef score(self, row):\n",
"\t\tsource = row[self.source_column]\n",
"\t\tif source == \"gsm8k\": \n",
"\t\t\treturn {self.SCORE_COLUMN: self.gsm8k(row)}\n",
"\t\telif source == \"math\": \n",
"\t\t\treturn {self.SCORE_COLUMN: self.math(row)}\n",
"\t\telif source == \"ifeval\":\n",
"\t\t\treturn {self.SCORE_COLUMN: self.ifeval(row)}\n",
"\t\telse:\n",
"\t\t\traise ValueError\n",
"\n",
"def main(args):\n",
" dataset_args, training_args = parse_args(args)\n",
" ...\n",
" train_dataset = prepare_dataset(train_dataset, tokenizer)\n",
" eval_dataset = prepare_dataset(eval_dataset, tokenizer)\n",
" # assume that the trainer will provide inputs as a single dict. if not, you can customize the interface for the scorer\n",
"\t# you can use `.score` or the __call__ interface to get the scores\n",
" reward_function = MyScorer(\"id\", \"response\", \"ground_truth\")\n",
"```"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
3 changes: 1 addition & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ authors = [
requires-python = ">=3.9,<3.11"
dependencies = [
"vllm==0.7.0",
"pyext",
"word2number",
"scipy",
"datasets",
Expand Down Expand Up @@ -45,7 +44,7 @@ skythought = ["evals/**/*.yaml", "evals/**/*.yml"]
skythought = "skythought.evals.cli:main"

[project.optional-dependencies]
dev = ["pytest", "pytest-mock", "black", "ruff", "pre-commit"]
dev = ["pytest", "pytest-mock", "pytest-asyncio", "black", "ruff", "pre-commit"]

[tool.ruff]
line-length = 160
Expand Down
Empty file.
43 changes: 43 additions & 0 deletions recipes/sky-t1-preview/postprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from typing import Any, Dict

STILL2_SYSTEM_PROMPT = "Your role as an assistant involves thoroughly exploring questions through a systematic long \
thinking process before providing the final precise and accurate solutions. This requires \
engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, \
backtracing, and iteration to develop well-considered thinking process. \
Please structure your response into two main sections: Thought and Solution. \
In the Thought section, detail your reasoning process using the specified format: \
<|begin_of_thought|> {thought with steps separated with '\n\n'} \
<|end_of_thought|> \
Each step should include detailed considerations such as analisying questions, summarizing \
relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining \
any errors, and revisiting previous steps. \
In the Solution section, based on various attempts, explorations, and reflections from the Thought \
section, systematically present the final solution that you deem correct. The solution should \
remain a logical, accurate, concise expression style and detail necessary step needed to reach the \
conclusion, formatted as follows: \
<|begin_of_solution|> \
{final formatted, precise, and clear solution} \
<|end_of_solution|> \
Now, try to solve the following question through the above guidelines:"


def convert_to_sharegpt_format(row: Dict[str, Any], prompt_column, response_column):
prompt = row[prompt_column]
# Create the conversation format
conversations = [
{"from": "user", "value": prompt},
{
"from": "assistant",
"value": row[response_column],
},
]

# Prepare the final structure
cur_data = {
"system": STILL2_SYSTEM_PROMPT,
"conversations": conversations,
# TODO: remove this
**row,
}

return cur_data
87 changes: 87 additions & 0 deletions recipes/sky-t1-preview/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import json

import pyarrow as pa
from ray.data import Schema


class APPSPreprocessor:
WITH_FN_NAME_TEMPLATE = "Generate an executable Python function generated from the given prompt. The function should take stdin as input and print the output. Simply call the function after the definition. {prompt}" # noqa: E501

WITHOUT_FN_NAME_TEMPLATE = "Generate an executable Python function generated from the given prompt. Return the function body without invoking it at the final solution. {prompt}" # noqa: E501

WITH_STARTER_CODE_TEMPLATE = "{input}\n{starter_code}"

def __call__(self, row):
test_case = json.loads(row["input_output"])
starter_code = row["starter_code"]
prompt = row["question"]
if not test_case.get("fn_name"):
_input = self.WITH_FN_NAME_TEMPLATE.format(prompt=prompt)
else:
_input = self.WITHOUT_FN_NAME_TEMPLATE.format(prompt=prompt)

if starter_code is not None:
_input = self.WITH_STARTER_CODE_TEMPLATE.format(
input=_input, starter_code=starter_code
)

return {**row, "user_input": _input}


class TACOPreprocessor:
INITIAL_TEMPLATE = "\nQUESTION:\n{prompt}"
STARTER_CODE_TEMPLATE = "{input}\n{starter_code}"
STDIN_TEMPLATE = "{input}\nUse Standard Input format\nANSWER:\n"
CALL_TEMPLATE = "{input}\nUse Call-Based format\nANSWER:\n"

def __call__(self, problem):

prompt = problem["question"]
starter_code = (
None if len(problem["starter_code"]) == 0 else problem["starter_code"]
)
try:
input_outpout = json.loads(problem["input_output"])
fn_name = (
None if not input_outpout.get("fn_name") else input_outpout["fn_name"]
)
except ValueError:
fn_name = None

_input = self.INITIAL_TEMPLATE.format(prompt=prompt)

if starter_code:
_input = self.STARTER_CODE_TEMPLATE.format(
input=_input, starter_code=starter_code
)
else:
_input = self.INITIAL_TEMPLATE.format(prompt=prompt)
if (not fn_name) and (not starter_code):
_input = self.STDIN_TEMPLATE.format(input=_input)
else:
_input = self.CALL_TEMPLATE.format(input=_input)

return {**problem, "user_input": _input}


class NUMINAPreprocessor:
TEMPLATE = "Return your final response within \\boxed{{}}. {prompt}"

def __call__(self, row):
prompt = row["problem"]
_input = self.TEMPLATE.format(prompt=prompt)
return {**row, "user_input": _input}


def taco_coerce_types(row, schema: Schema):
for key, schema_type in zip(schema.names, schema.types):
value = pa.array([row[key]])
if value.type != schema_type:
if schema_type == pa.string():
try:
row[key] = str(row[key])
except Exception:
row[key] = ""
elif schema_type == pa.null():
row[key] = None
return row
Loading
Loading