Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ dependencies = [
"pydantic",
"setuptools",
"typer",
"hf_transfer",
]
license = { text = "Apache-2.0" }
readme = "README.md"
Expand Down
11 changes: 7 additions & 4 deletions recipes/sky-t1-preview/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ skythought generate --task apps --model Qwen/QwQ-32B-Preview --backend vllm --ba

skythought generate --task taco --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "MEDIUM"}}' --result-dir $SKYT_HOME/data

skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "math"}}' --result-dir $SKYT_HOME/data
skythought generate --task numina_math --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data

skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "amc_aime"}}' --result-dir $SKYT_HOME/data
skythought generate --task numina_amc_aime --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data

skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "olympiads"}}' --result-dir $SKYT_HOME/data --start 0 --end 20000
skythought generate --task numina_olympiads --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data --start 0 --end 20000
```

This will save the results in individual folders in `result-dir`. The directory structure should be as follows:
Expand All @@ -63,10 +63,13 @@ python scripts/convert_format.py --input_dir $SKYT_HOME/data --keys keys.txt

### Step 3: Reject Sampling on the formatted data (Example Usage with previous script)

For each folder in `result-dir` saved previously (ex: `Qwen_QwQ-32B-Preview_numina_myHash`), obtain the scores with the following command

```shell
skythought score --task apps --path <path_to_run_folder>
```
Similar for other datasets.

This will overwrite the `results.json` files and add a `"correctness"` entry to each model response.

### Convert to ShareGPT format for training
After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above.
Expand Down
62 changes: 59 additions & 3 deletions skythought/evals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,25 @@ For running OpenAI model, export the OpenAI key.
export OPENAI_API_KEY={openai_api_key}
```

## Usage

We provide three commands in the CLI:

- `skythought evaluate` : Evaluate a model on a given task.
- `skythought generate`: Generate model outputs for a pre-configured task.
- `skythought score`: Score saved generations for a given task.

For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb).

## Generation and Evaluation

### Benchmark Evaluation

Given below are two examples for evaluation. For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb).
Given below are two examples for evaluation.

```shell
skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime --backend vllm --backend-args tensor_parallel_size=8 --sampling-params temperature=0.6,top_p=0.95 --n 8 --result-dir ./

skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task gpqa_diamond --backend vllm --backend-args tensor_parallel_size=8 --sampling-params temperature=0.6,top_p=0.95 --n 8
```

Expand Down Expand Up @@ -85,7 +96,13 @@ Currently we support distill and reject sampling for NUMINA, APPS, and TACO data
#### Example Usage

```shell
skythought generate --model Qwen/QwQ-32B-Preview --task apps --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
skythought generate --model Qwen/QwQ-32B-Preview --task numina_amc_aime --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
```

Once the generations are saved, you can then apply any postprocessing on the results (saved in a `results.json` file in separate run folder) and then run:

```shell
skythought score --task numina_amc_aime --run-dir <path>
```

### Reproducibility Issues
Expand All @@ -97,4 +114,43 @@ We've noticed that it can be hard to reproduce results in reasoning benchmarks.
- vLLM settings: With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.
- vLLM version: Different versions of vLLM will use different CUDA-Toolkit or Flash attention versions. Even for the same settings, these differences in the underlying kernels used can change results.

We recommend to run all evaluation benchmarks at full precision, i.e float32 to avoid this. By default, we run evaluation in `float32`, which can be customized with the `--backend-args` flag for local inference. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
We recommend to run evaluation benchmarks at full precision, i.e float32 to avoid this. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.


## Key Concepts

### Tasks

A Task consists of task-specific configuration and implements
- Dataset loading and preprocessing
- Creating of input conversation to the model
- Scoring of model responses

The configuration (`TaskConfig`) contains dataset loading related details such as Hugging Face dataset ID, the particular subset for this benchmark (e.g., ”Challenge” subset for ARC), and a task template, which contains task-specific instructions to be used (Eg: `Return your answer in \boxed{}`). Each configuration is stored in a YAML. For example, you can see the YAML in this [aime24.yaml file](./tasks/aime/aime24.yaml)

Internally, a Task implementation is termed a "TaskHandler", you can see one such implementation [here](./tasks/aime/aime_handler.py).


To add a new task `mytask`:
- First, see if the task can be simply specified as a configuration (One example is [`aime25`](./tasks/aime/aime25.yaml)). If so, you can add a YAML file in the appropriate folder and re-use an existing handler. (All available handlers are specified [here](./tasks/__init__.py)).
- If not, you should create a new `TaskHandler` subclass for this task along with a task configuration YAML (`mytask.yaml`).

### Models

A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. Different reasoning models use their own system prompt, and some perform best when the response is prefilled with special tokens.

We store our pre-configured models as well as a list of system prompt templates [here](./models/model_configs.yaml).

### Backend

The Backend is concerned with how the LLM instance is created and queried. For flexibility, we support
- Local inference with vLLM (basic single node) or Ray+vLLM (more scalable single and multi-node inference)
- Remote inference behind an OpenAI-compatible endpoint.

The Backend also consists of configuration at instantiation (ex; the data type for the model), along with sampling parameters during generation (temperature, max tokens, etc).

During evaluation, the above tie in together and the flow is as follows:
1. Load dataset and create conversations based on the Task and Model specified by the user
2. Generate model responses from the Backend based on the provided sampling parameters
3. Score model responses based on the Task
4. Output final results
15 changes: 11 additions & 4 deletions skythought/evals/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@

app = typer.Typer(pretty_exceptions_enable=False)

SAMPLING_PARAMS_DEFAULT = "temperature=0,top_p=1,max_tokens=32768"


def get_run_config(
task: str,
Expand Down Expand Up @@ -109,7 +111,10 @@ def parse_common_args(
f"Task {task} not found. Should be one of {TASK_NAMES_TO_YAML.keys()}"
)
task_args_as_dict = parse_multi_args(task_args)
sampling_params_as_dict = parse_multi_args(sampling_params)
user_provided_sampling_params_as_dict = parse_multi_args(sampling_params)
sampling_params_as_dict = parse_multi_args(SAMPLING_PARAMS_DEFAULT)
sampling_params_as_dict.update(user_provided_sampling_params_as_dict)

backend_args_as_dict = parse_multi_args(backend_args)

if n is not None:
Expand Down Expand Up @@ -189,7 +194,7 @@ def evaluate(
backend_args: Annotated[
str,
typer.Option(
help="Backend parameters to use for inference. For open-source models, we perform inference in float32 by default",
help="Backend parameters to use for inference.",
case_sensitive=False,
),
] = "",
Expand All @@ -199,7 +204,7 @@ def evaluate(
help="Sampling parameters to use for inference.",
case_sensitive=False,
),
] = "temperature=0,top_p=1,max_tokens=32768",
] = SAMPLING_PARAMS_DEFAULT,
result_dir: Annotated[
str,
typer.Option(
Expand All @@ -225,7 +230,9 @@ def evaluate(
seed: Annotated[int, typer.Option(help="Random seed.")] = 41,
assistant_prefill: Annotated[
str,
typer.Option(help=r'Assistant prefill for the model response. Ex: "<think>\n"'),
typer.Option(
help=r'Assistant prefill for the model response, overriding any pre-configured assistant prefill for this model. Ex: "<think>\n"'
),
] = None,
as_test: Annotated[
bool, typer.Option(help="Perform a test run on 10 samples of the dataset.")
Expand Down