Introduce new scoring APIs for curation + training by SumanthRH · Pull Request #88 · NovaSky-AI/SkyThought

SumanthRH · 2025-02-27T06:02:43Z

What does this PR do?

WIP PR to introduce the new scoring APIs to be shared between evaluation + curation + training.

Also adds an example for using this with the new ray.data.llm APIs: docs.ray.io/en/master/data/working-with-llms.html using Sky-T1-32B-Preview data curation.

TODO:

Complete recipe for Sky-T1-32B-Preview data curation
Add tests for TACO and APPS (basic tests for helper functions)
Add tests for new the Scorer interface
Add basic documentation for scoring APIs

Signed-off-by: SumanthRH <[email protected]>

SumanthRH · 2025-02-28T22:47:39Z

recipes/sky-t1-preview/recipe.py

+# We explicitly set the target number of blocks to help tune performance.
+# For materialized datasets, the number of blocks determined by ray data can be small,
+# especially for a multi-stage pipeline like the one here.
+TARGET_NUM_ROWS_PER_BLOCK = 100


I'm still tuning settings like these for performance

SumanthRH · 2025-02-28T22:48:22Z

skythought/evals/tasks/numina/numina_handler.py

    @timeout(5)  # Add timeout of 5 seconds
    def check_correctness(self, problem, generation):
        solution = extract_answer(problem[self.task_config.answer_key])
-        solution = strip_answer_string(solution)


not needed because strip_answer_string is already called in extract_answer

SumanthRH · 2025-02-28T22:48:44Z

skythought/evals/tasks/taco/taco_handler.py

        return dataset.iloc[start:end] if end > 0 else dataset.iloc[start:]
+
+
+def _temp_run(problem, generation, debug, result):


Placed outside to solve the same issue as in #89

Signed-off-by: SumanthRH <[email protected]>

SumanthRH · 2025-02-28T23:30:15Z

recipes/sky-t1-preview/recipe.py

+    numina_ds_olympiads = numina_ds_olympiads.limit(num_samples)
+    numina_ds_math = numina_ds_math.limit(num_samples)
+
+# 2. Get model responses for each of the datasets


to remove comment

SumanthRH · 2025-03-01T00:37:15Z

recipes/sky-t1-preview/postprocess.py

+
+def convert_to_sharegpt_format(row: Dict[str, Any], prompt_column, response_column):
+    prompt = row[prompt_column]
+    # accept


Signed-off-by: SumanthRH <[email protected]>

SumanthRH · 2025-03-11T23:51:58Z

tests/evals/scoring/apps/test_apps.py

Will move the actual example to a config file once other tests also make it in. Currently there's only one test so having all the context in one place is good

erictang000 · 2025-03-12T00:30:40Z

skythought/evals/scoring/livecodebench/livecodebench_scorer.py

+        backend: The backend to use for scoring. Supports "ray" or "mp" (str).
+    """
+
+    TIMEOUT = 6


just wondering how did you pick this value (6 here and 10 for apps)?

the timeout value is actually from the original source code for the datasets.

erictang000

looks good to me!

erictang000 · 2025-03-12T00:33:00Z

skythought/evals/scoring/livecodebench/livecodebench_scorer.py

+        backend: The backend to use for scoring. Supports "ray" or "mp" (str).
+    """
+
+    TIMEOUT = 6


also should this be a class constant or could it make sense for this to be a user configurable parameter passed into the constructor?

erictang000 · 2025-03-12T00:38:57Z

recipes/sky-t1-preview/recipe.py

@@ -0,0 +1,272 @@
+"""
+This is the recipe for data curation for the Sky T1 Preview model . 


Suggested change

This is the recipe for data curation for the Sky T1 Preview model .

This is the recipe for data curation for the Sky T1 Preview model.

erictang000 · 2025-03-12T00:40:31Z

recipes/sky-t1-preview/recipe.py

+
+    config = vLLMEngineProcessorConfig(
+        model="Qwen/QwQ-32B-Preview",
+        # model="Qwen/Qwen2-0.5B-Instruct",


remove this?

Yes will do. recipe.py is still under construction.

Signed-off-by: SumanthRH <[email protected]>

PR to introduce the new scoring APIs to be shared between evaluation + curation + training. Also adds an example for using this with the new ray.data.llm APIs: docs.ray.io/en/master/data/working-with-llms.html using Sky-T1-32B-Preview data curation. The recipe is still WIP and has some rough edges on the original training dataset size, which will be fixed in a follow-up PR.

SumanthRH added 12 commits February 24, 2025 13:14

x

a0df2e6

Signed-off-by: SumanthRH <[email protected]>

initial commit

cbf83dc

Signed-off-by: SumanthRH <[email protected]>

get it done

2620713

Signed-off-by: SumanthRH <[email protected]>

let's add some docstrings

90bd4f5

Signed-off-by: SumanthRH <[email protected]>

x

e113874

Signed-off-by: SumanthRH <[email protected]>

more

6fdca5c

Signed-off-by: SumanthRH <[email protected]>

x

27bf724

Signed-off-by: SumanthRH <[email protected]>

improve and add tests

f55909f

Signed-off-by: SumanthRH <[email protected]>

switch to actor

5dd5978

Signed-off-by: SumanthRH <[email protected]>

add taco test

871de06

Signed-off-by: SumanthRH <[email protected]>

add more tests and fix resource requests

6a8bb2f

Signed-off-by: SumanthRH <[email protected]>

improve recipe

9e3203d

Signed-off-by: SumanthRH <[email protected]>

SumanthRH requested a review from erictang000 February 28, 2025 22:38

SumanthRH commented Feb 28, 2025

View reviewed changes

update tests and move some files

6c35b22

Signed-off-by: SumanthRH <[email protected]>

SumanthRH commented Feb 28, 2025

View reviewed changes

SumanthRH commented Mar 1, 2025

View reviewed changes

SumanthRH added 2 commits February 28, 2025 16:38

update recipe

8b6509c

Signed-off-by: SumanthRH <[email protected]>

x

9ad1238

Signed-off-by: SumanthRH <[email protected]>

SumanthRH commented Mar 11, 2025

View reviewed changes

erictang000 reviewed Mar 12, 2025

View reviewed changes

erictang000 approved these changes Mar 12, 2025

View reviewed changes

SumanthRH added 2 commits March 20, 2025 18:39

x

4282cb7

Signed-off-by: SumanthRH <[email protected]>

Merge remote-tracking branch 'origin' into sumanthrh/curation-apis

bc55a6c

Signed-off-by: SumanthRH <[email protected]>

SumanthRH marked this pull request as ready for review March 20, 2025 18:43

SumanthRH merged commit 75712f6 into main Mar 20, 2025
2 checks passed

		return dataset.iloc[start:end] if end > 0 else dataset.iloc[start:]


		def _temp_run(problem, generation, debug, result):

		@@ -0,0 +1,272 @@
		"""
		This is the recipe for data curation for the Sky T1 Preview model .

Conversation

SumanthRH commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SumanthRH Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erictang000 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SumanthRH commented Feb 27, 2025 •

edited

Loading

SumanthRH Feb 28, 2025 •

edited

Loading