Skip to content

Commit 2ff6858

Browse files
authored
Improve documentation and default arguments behaviour in CLI (#87)
# What does this PR do? - Improves documentation for tasks and models, with a note on how things are structured in the repo (for contributors/advanced users). - Fixed incorrect docstrings in the CLI - Added `hf_transfer` to dependency since we default to downloading with `hf_transfer` (it should be pretty safe to use in almost all cases, and can download much faster) - Fixed some stale information in the recipes. - Fixed default arguments behaviour for sampling_params: user-provided args should `update` our default settings. Before, we completely disregarded our defaults for temperature, etc if the user-provided any argument.
1 parent 9a7b4b3 commit 2ff6858

File tree

4 files changed

+78
-11
lines changed

4 files changed

+78
-11
lines changed

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ dependencies = [
1616
"pydantic",
1717
"setuptools",
1818
"typer",
19+
"hf_transfer",
1920
]
2021
license = { text = "Apache-2.0" }
2122
readme = "README.md"

recipes/sky-t1-preview/README.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,11 +34,11 @@ skythought generate --task apps --model Qwen/QwQ-32B-Preview --backend vllm --ba
3434

3535
skythought generate --task taco --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "MEDIUM"}}' --result-dir $SKYT_HOME/data
3636

37-
skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "math"}}' --result-dir $SKYT_HOME/data
37+
skythought generate --task numina_math --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
3838

39-
skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "amc_aime"}}' --result-dir $SKYT_HOME/data
39+
skythought generate --task numina_amc_aime --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
4040

41-
skythought generate --task numina --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --task-args '{"dataset_split": "train", "preprocess_config": {"difficulty": "olympiads"}}' --result-dir $SKYT_HOME/data --start 0 --end 20000
41+
skythought generate --task numina_olympiads --model Qwen/QwQ-32B-Preview --backend vllm --backend-args tp=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data --start 0 --end 20000
4242
```
4343

4444
This will save the results in individual folders in `result-dir`. The directory structure should be as follows:
@@ -63,10 +63,13 @@ python scripts/convert_format.py --input_dir $SKYT_HOME/data --keys keys.txt
6363

6464
### Step 3: Reject Sampling on the formatted data (Example Usage with previous script)
6565

66+
For each folder in `result-dir` saved previously (ex: `Qwen_QwQ-32B-Preview_numina_myHash`), obtain the scores with the following command
67+
6668
```shell
6769
skythought score --task apps --path <path_to_run_folder>
6870
```
69-
Similar for other datasets.
71+
72+
This will overwrite the `results.json` files and add a `"correctness"` entry to each model response.
7073

7174
### Convert to ShareGPT format for training
7275
After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above.

skythought/evals/README.md

Lines changed: 59 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,25 @@ For running OpenAI model, export the OpenAI key.
1010
export OPENAI_API_KEY={openai_api_key}
1111
```
1212

13+
## Usage
14+
15+
We provide three commands in the CLI:
16+
17+
- `skythought evaluate` : Evaluate a model on a given task.
18+
- `skythought generate`: Generate model outputs for a pre-configured task.
19+
- `skythought score`: Score saved generations for a given task.
20+
21+
For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb).
22+
1323
## Generation and Evaluation
1424

1525
### Benchmark Evaluation
1626

17-
Given below are two examples for evaluation. For a walkthrough on the basics, please refer to the [example](../../examples/evaluate.ipynb).
27+
Given below are two examples for evaluation.
1828

1929
```shell
2030
skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task aime --backend vllm --backend-args tensor_parallel_size=8 --sampling-params temperature=0.6,top_p=0.95 --n 8 --result-dir ./
31+
2132
skythought evaluate --model NovaSky-AI/Sky-T1-32B-Preview --task gpqa_diamond --backend vllm --backend-args tensor_parallel_size=8 --sampling-params temperature=0.6,top_p=0.95 --n 8
2233
```
2334

@@ -85,7 +96,13 @@ Currently we support distill and reject sampling for NUMINA, APPS, and TACO data
8596
#### Example Usage
8697

8798
```shell
88-
skythought generate --model Qwen/QwQ-32B-Preview --task apps --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
99+
skythought generate --model Qwen/QwQ-32B-Preview --task numina_amc_aime --backend ray --backend-args tensor_parallel_size=8 --sampling-params max_tokens=16384 --result-dir $SKYT_HOME/data
100+
```
101+
102+
Once the generations are saved, you can then apply any postprocessing on the results (saved in a `results.json` file in separate run folder) and then run:
103+
104+
```shell
105+
skythought score --task numina_amc_aime --run-dir <path>
89106
```
90107

91108
### Reproducibility Issues
@@ -97,4 +114,43 @@ We've noticed that it can be hard to reproduce results in reasoning benchmarks.
97114
- vLLM settings: With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.
98115
- vLLM version: Different versions of vLLM will use different CUDA-Toolkit or Flash attention versions. Even for the same settings, these differences in the underlying kernels used can change results.
99116

100-
We recommend to run all evaluation benchmarks at full precision, i.e float32 to avoid this. By default, we run evaluation in `float32`, which can be customized with the `--backend-args` flag for local inference. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
117+
We recommend to run evaluation benchmarks at full precision, i.e float32 to avoid this. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
118+
119+
120+
## Key Concepts
121+
122+
### Tasks
123+
124+
A Task consists of task-specific configuration and implements
125+
- Dataset loading and preprocessing
126+
- Creating of input conversation to the model
127+
- Scoring of model responses
128+
129+
The configuration (`TaskConfig`) contains dataset loading related details such as Hugging Face dataset ID, the particular subset for this benchmark (e.g., ”Challenge” subset for ARC), and a task template, which contains task-specific instructions to be used (Eg: `Return your answer in \boxed{}`). Each configuration is stored in a YAML. For example, you can see the YAML in this [aime24.yaml file](./tasks/aime/aime24.yaml)
130+
131+
Internally, a Task implementation is termed a "TaskHandler", you can see one such implementation [here](./tasks/aime/aime_handler.py).
132+
133+
134+
To add a new task `mytask`:
135+
- First, see if the task can be simply specified as a configuration (One example is [`aime25`](./tasks/aime/aime25.yaml)). If so, you can add a YAML file in the appropriate folder and re-use an existing handler. (All available handlers are specified [here](./tasks/__init__.py)).
136+
- If not, you should create a new `TaskHandler` subclass for this task along with a task configuration YAML (`mytask.yaml`).
137+
138+
### Models
139+
140+
A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. Different reasoning models use their own system prompt, and some perform best when the response is prefilled with special tokens.
141+
142+
We store our pre-configured models as well as a list of system prompt templates [here](./models/model_configs.yaml).
143+
144+
### Backend
145+
146+
The Backend is concerned with how the LLM instance is created and queried. For flexibility, we support
147+
- Local inference with vLLM (basic single node) or Ray+vLLM (more scalable single and multi-node inference)
148+
- Remote inference behind an OpenAI-compatible endpoint.
149+
150+
The Backend also consists of configuration at instantiation (ex; the data type for the model), along with sampling parameters during generation (temperature, max tokens, etc).
151+
152+
During evaluation, the above tie in together and the flow is as follows:
153+
1. Load dataset and create conversations based on the Task and Model specified by the user
154+
2. Generate model responses from the Backend based on the provided sampling parameters
155+
3. Score model responses based on the Task
156+
4. Output final results

skythought/evals/cli.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@
2929

3030
app = typer.Typer(pretty_exceptions_enable=False)
3131

32+
SAMPLING_PARAMS_DEFAULT = "temperature=0,top_p=1,max_tokens=32768"
33+
3234

3335
def get_run_config(
3436
task: str,
@@ -109,7 +111,10 @@ def parse_common_args(
109111
f"Task {task} not found. Should be one of {TASK_NAMES_TO_YAML.keys()}"
110112
)
111113
task_args_as_dict = parse_multi_args(task_args)
112-
sampling_params_as_dict = parse_multi_args(sampling_params)
114+
user_provided_sampling_params_as_dict = parse_multi_args(sampling_params)
115+
sampling_params_as_dict = parse_multi_args(SAMPLING_PARAMS_DEFAULT)
116+
sampling_params_as_dict.update(user_provided_sampling_params_as_dict)
117+
113118
backend_args_as_dict = parse_multi_args(backend_args)
114119

115120
if n is not None:
@@ -189,7 +194,7 @@ def evaluate(
189194
backend_args: Annotated[
190195
str,
191196
typer.Option(
192-
help="Backend parameters to use for inference. For open-source models, we perform inference in float32 by default",
197+
help="Backend parameters to use for inference.",
193198
case_sensitive=False,
194199
),
195200
] = "",
@@ -199,7 +204,7 @@ def evaluate(
199204
help="Sampling parameters to use for inference.",
200205
case_sensitive=False,
201206
),
202-
] = "temperature=0,top_p=1,max_tokens=32768",
207+
] = SAMPLING_PARAMS_DEFAULT,
203208
result_dir: Annotated[
204209
str,
205210
typer.Option(
@@ -225,7 +230,9 @@ def evaluate(
225230
seed: Annotated[int, typer.Option(help="Random seed.")] = 41,
226231
assistant_prefill: Annotated[
227232
str,
228-
typer.Option(help=r'Assistant prefill for the model response. Ex: "<think>\n"'),
233+
typer.Option(
234+
help=r'Assistant prefill for the model response, overriding any pre-configured assistant prefill for this model. Ex: "<think>\n"'
235+
),
229236
] = None,
230237
as_test: Annotated[
231238
bool, typer.Option(help="Perform a test run on 10 samples of the dataset.")

0 commit comments

Comments
 (0)