Minor fixes after #23 (#58)

SumanthRH · web-flow · commit a399909e0ad5 · 2025-02-03T21:20:58.000-08:00
Corrects README for new CLI. Fixes APPS task handler and corrects Math task handler's key to "math"
diff --git a/README.md b/README.md
@@ -35,14 +35,31 @@
 
 We open source the code and scripts we used for data curation, training, and evaluation for Sky-T1-32B-Preview, you can find more details in each directory.
 - ``/data``: The 17k training data used to train Sky-T1-32B-Preview. We also add the science and riddle portion from the [STILL-2 model](https://arxiv.org/pdf/2412.09413).
-- ``skythought/tools``: Training data curation and evaluation for Sky-T1. To generate our training data, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality.
+- ``skythought/skythought_evals``: Our data generation and evaluation library. To generate the training data for Sky-T1, we use the QwQ-32B-Preview model. We curate the data mixture to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. 
 - ``skythought/train``: Training scripts for Sky-T1. We use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to perform training. The model was trained for 3 epochs with a learning rate of 1e-5 and a batch size of 96. Our model training was completed in 19 hours on 8 H100 GPUs using DeepSpeed Zero-3 offloading, costing approximately $450 as per Lambda Cloud pricing. 
 
 
 # Evaluation
-Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.
+
+## Usage
+
+First, clone the repository and install the package
+
+```shell
+git clone https://github.com/NovaSky-AI/SkyThought.git
+cd SkyThought
+# installs shown for conda
+conda create -n eval python==3.10
+conda activate eval 
+pip install -e .
+```
+
+For running evaluation, please refer to [skythought_evals/README.md](skythought/skythought_evals/README.md).
+
 
 ### Evaluation results
+Following, we show our evaluation results for the Sky-T1-32B-Preview model across math, coding, and science benchmarks.
+
 | Metric                | Sky-T1-32B-Preview | Qwen-2.5-32B-Instruct | QwQ   | o1-preview |
 |-----------------------|---------------------|--------|-------|------------|
 | Math500              | 86.4                    | 81.4    | 92.2 | 81.4       |
@@ -51,7 +68,7 @@ Following, we show our evaluation results for the Sky-T1-32B-Preview model acros
 | LiveCodeBench-Medium | 56.8                    | 40.8   | 56.3  | 54.9       |
 | LiveCodeBench-Hard   | 17.9                    | 9.8   | 17.1  | 16.3       |
 | GPQA-Diamond         | 56.8                    | 45.5   | 52.5  | 75.2       |
-| OlympiadBench (Math, EN)    | 59.79	           | 46.74	| 62.17	 | -        | 
+| OlympiadBench (Math, EN)    | 59.79	           | 46.74	| 62.17	 | 59.2      | 
 
 #### Results on non-reasoning benchmarks
 
diff --git a/skythought/skythought_evals/README.md b/skythought/skythought_evals/README.md
@@ -2,12 +2,8 @@
 This document describes the steps to training data curation and evaluation scripts for Sky-T1. 
 
 ## Requirements 
-First create the environment as follows.
-```shell
-conda create -n eval python==3.10
-conda activate eval 
-pip install -r requirements.txt
-```
+
+Make sure you have installed the `skythought-evals` package as outlined in the [README.md](../README.md).
 
 For running OpenAI model, export the OpenAI key. 
 ```shell
diff --git a/skythought/skythought_evals/tasks/__init__.py b/skythought/skythought_evals/tasks/__init__.py
@@ -17,7 +17,7 @@
     "numina": NUMINATaskHandler,
     "apps": APPSTaskHandler,
     "taco": TACOTaskHandler,
-    "math500": MathTaskHandler,
+    "math": MathTaskHandler,
     "aime": AIMETaskHandler,
     "gpqa_diamond": GPQADiamondTaskHandler,
     "mmlu": MMLUTaskHandler,
diff --git a/skythought/skythought_evals/tasks/apps/apps_handler.py b/skythought/skythought_evals/tasks/apps/apps_handler.py
@@ -101,7 +101,7 @@ def make_conversations(self, data, system_prompt, model=None):
     def load_and_filter_dataset(
         self, start, end, split=None, subset=None, difficulty=None, args=None
     ):
-        train_data = self.load_dataset(subset=subset, split=split).to_pandas()
+        train_data = self.load_dataset(subset=subset, split=split)
         if difficulty or "difficulty" in self.task_config.preprocess_config:
             difficulty = (
                 self.task_config.preprocess_config["difficulty"]
@@ -110,6 +110,8 @@ def load_and_filter_dataset(
             )
             train_data = train_data.filter(lambda x: x["difficulty"] == difficulty)
 
+        train_data = train_data.to_pandas()
+
         return train_data.iloc[start:end] if end > 0 else train_data.iloc[start:]
 
     def process_remaining_data(self, train_data, results):