Skip to content

Commit a399909

Browse files
authored
Minor fixes after #23 (#58)
Corrects README for new CLI. Fixes APPS task handler and corrects Math task handler's key to "math"
1 parent a85f0f4 commit a399909

File tree

4 files changed

+26
-11
lines changed

4 files changed

+26
-11
lines changed

README.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,14 +35,31 @@
3535

3636
We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
3737
- ``/data``: The 17k training data used to train Sky-T1-32B-Preview. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413).
38-
- ``skythought/tools``: Training data curation and evaluation for Sky-T1. To generate our training data, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality.
38+
- ``skythought/skythought_evals``: Our data generation and evaluation library. To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality.
3939
- ``skythought/train``: Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. The model was trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96. Our model training was completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offloading, costing approximately $450 as per Lambda Cloud pricing.
4040

4141

4242
# Evaluation
43-
Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.
43+
44+
## Usage
45+
46+
First, clone the repository and install the package
47+
48+
```shell
49+
git clone https://github.com/NovaSky-AI/SkyThought.git
50+
cd SkyThought
51+
# installs shown for conda
52+
conda create -n eval python==3.10
53+
conda activate eval
54+
pip install -e .
55+
```
56+
57+
For running evaluation, please refer to [skythought_evals/README.md](skythought/skythought_evals/README.md).
58+
4459

4560
### Evaluation results
61+
Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.
62+
4663
| Metric | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ | o1-preview |
4764
|-----------------------|---------------------|--------|-------|------------|
4865
| Math500 | 86.4 | 81.4 | 92.2 | 81.4 |
@@ -51,7 +68,7 @@ Following, we show our evaluation results for the Sky-T1-32B-Preview model acros
5168
| LiveCodeBench-Medium | 56.8 | 40.8 | 56.3 | 54.9 |
5269
| LiveCodeBench-Hard | 17.9 | 9.8 | 17.1 | 16.3 |
5370
| GPQA-Diamond | 56.8 | 45.5 | 52.5 | 75.2 |
54-
| OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | - |
71+
| OlympiadBench (Math, EN) | 59.79 | 46.74 | 62.17 | 59.2 |
5572

5673
#### Results on non-reasoning benchmarks
5774

skythought/skythought_evals/README.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,8 @@
22
This document describes the steps to training data curation and evaluation scripts for Sky-T1.
33

44
## Requirements
5-
First create the environment as follows.
6-
```shell
7-
conda create -n eval python==3.10
8-
conda activate eval
9-
pip install -r requirements.txt
10-
```
5+
6+
Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md).
117

128
For running OpenAI model, export the OpenAI key.
139
```shell

skythought/skythought_evals/tasks/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
"numina": NUMINATaskHandler,
1818
"apps": APPSTaskHandler,
1919
"taco": TACOTaskHandler,
20-
"math500": MathTaskHandler,
20+
"math": MathTaskHandler,
2121
"aime": AIMETaskHandler,
2222
"gpqa_diamond": GPQADiamondTaskHandler,
2323
"mmlu": MMLUTaskHandler,

skythought/skythought_evals/tasks/apps/apps_handler.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ def make_conversations(self, data, system_prompt, model=None):
101101
def load_and_filter_dataset(
102102
self, start, end, split=None, subset=None, difficulty=None, args=None
103103
):
104-
train_data = self.load_dataset(subset=subset, split=split).to_pandas()
104+
train_data = self.load_dataset(subset=subset, split=split)
105105
if difficulty or "difficulty" in self.task_config.preprocess_config:
106106
difficulty = (
107107
self.task_config.preprocess_config["difficulty"]
@@ -110,6 +110,8 @@ def load_and_filter_dataset(
110110
)
111111
train_data = train_data.filter(lambda x: x["difficulty"] == difficulty)
112112

113+
train_data = train_data.to_pandas()
114+
113115
return train_data.iloc[start:end] if end > 0 else train_data.iloc[start:]
114116

115117
def process_remaining_data(self, train_data, results):

0 commit comments

Comments
 (0)