You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve documentation and default arguments behaviour in CLI (#87)
# What does this PR do?
- Improves documentation for tasks and models, with a note on how things are structured in the repo (for contributors/advanced users).
- Fixed incorrect docstrings in the CLI
- Added `hf_transfer` to dependency since we default to downloading with `hf_transfer` (it should be pretty safe to use in almost all cases, and can download much faster)
- Fixed some stale information in the recipes.
- Fixed default arguments behaviour for sampling_params: user-provided args should `update` our default settings. Before, we completely disregarded our defaults for temperature, etc if the user-provided any argument.
This will overwrite the `results.json` files and add a `"correctness"` entry to each model response.
70
73
71
74
### Convert to ShareGPT format for training
72
75
After obtaining multiple converted files, merge them together and convert to the ShareGPT format to perform training. In our preview model, we also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413), where interested readers can download their part of data and simply concatenating to the data obtained above.
Once the generations are saved, you can then apply any postprocessing on the results (saved in a `results.json` file in separate run folder) and then run:
@@ -97,4 +114,43 @@ We've noticed that it can be hard to reproduce results in reasoning benchmarks.
97
114
- vLLM settings: With vLLM, we’ve also noticed that at half-precision, different batch sizes can affect downstream evaluation results by a few percentage points. Further, different tensor parallelism settings can also change results in half-precision.
98
115
- vLLM version: Different versions of vLLM will use different CUDA-Toolkit or Flash attention versions. Even for the same settings, these differences in the underlying kernels used can change results.
99
116
100
-
We recommend to run all evaluation benchmarks at full precision, i.e float32 to avoid this. By default, we run evaluation in `float32`, which can be customized with the `--backend-args` flag for local inference. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
117
+
We recommend to run evaluation benchmarks at full precision, i.e float32 to avoid this. In full-precision, evaluation results should be robust to changes in batch size, tensor parallel size, version differences, etc.
118
+
119
+
120
+
## Key Concepts
121
+
122
+
### Tasks
123
+
124
+
A Task consists of task-specific configuration and implements
125
+
- Dataset loading and preprocessing
126
+
- Creating of input conversation to the model
127
+
- Scoring of model responses
128
+
129
+
The configuration (`TaskConfig`) contains dataset loading related details such as Hugging Face dataset ID, the particular subset for this benchmark (e.g., ”Challenge” subset for ARC), and a task template, which contains task-specific instructions to be used (Eg: `Return your answer in \boxed{}`). Each configuration is stored in a YAML. For example, you can see the YAML in this [aime24.yaml file](./tasks/aime/aime24.yaml)
130
+
131
+
Internally, a Task implementation is termed a "TaskHandler", you can see one such implementation [here](./tasks/aime/aime_handler.py).
132
+
133
+
134
+
To add a new task `mytask`:
135
+
- First, see if the task can be simply specified as a configuration (One example is [`aime25`](./tasks/aime/aime25.yaml)). If so, you can add a YAML file in the appropriate folder and re-use an existing handler. (All available handlers are specified [here](./tasks/__init__.py)).
136
+
- If not, you should create a new `TaskHandler` subclass for this task along with a task configuration YAML (`mytask.yaml`).
137
+
138
+
### Models
139
+
140
+
A Model consists of the model ID and templating configuration. This configuration optionally contains the system prompt and an assistant prefill message. Different reasoning models use their own system prompt, and some perform best when the response is prefilled with special tokens.
141
+
142
+
We store our pre-configured models as well as a list of system prompt templates [here](./models/model_configs.yaml).
143
+
144
+
### Backend
145
+
146
+
The Backend is concerned with how the LLM instance is created and queried. For flexibility, we support
147
+
- Local inference with vLLM (basic single node) or Ray+vLLM (more scalable single and multi-node inference)
148
+
- Remote inference behind an OpenAI-compatible endpoint.
149
+
150
+
The Backend also consists of configuration at instantiation (ex; the data type for the model), along with sampling parameters during generation (temperature, max tokens, etc).
151
+
152
+
During evaluation, the above tie in together and the flow is as follows:
153
+
1. Load dataset and create conversations based on the Task and Model specified by the user
154
+
2. Generate model responses from the Backend based on the provided sampling parameters
0 commit comments