Skip to content

Conversation

@SumanthRH
Copy link
Member

@SumanthRH SumanthRH commented Feb 27, 2025

What does this PR do?

WIP PR to introduce the new scoring APIs to be shared between evaluation + curation + training.

Also adds an example for using this with the new ray.data.llm APIs: docs.ray.io/en/master/data/working-with-llms.html using Sky-T1-32B-Preview data curation.

TODO:

  • Complete recipe for Sky-T1-32B-Preview data curation
  • Add tests for TACO and APPS (basic tests for helper functions)
  • Add tests for new the Scorer interface
  • Add basic documentation for scoring APIs

x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Comment on lines 38 to 41
# We explicitly set the target number of blocks to help tune performance.
# For materialized datasets, the number of blocks determined by ray data can be small,
# especially for a multi-stage pipeline like the one here.
TARGET_NUM_ROWS_PER_BLOCK = 100
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still tuning settings like these for performance

@timeout(5) # Add timeout of 5 seconds
def check_correctness(self, problem, generation):
solution = extract_answer(problem[self.task_config.answer_key])
solution = strip_answer_string(solution)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed because strip_answer_string is already called in extract_answer

return dataset.iloc[start:end] if end > 0 else dataset.iloc[start:]


def _temp_run(problem, generation, debug, result):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placed outside to solve the same issue as in #89

numina_ds_olympiads = numina_ds_olympiads.limit(num_samples)
numina_ds_math = numina_ds_math.limit(num_samples)

# 2. Get model responses for each of the datasets
Copy link
Member Author

@SumanthRH SumanthRH Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove comment


def convert_to_sharegpt_format(row: Dict[str, Any], prompt_column, response_column):
prompt = row[prompt_column]
# accept
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move the actual example to a config file once other tests also make it in. Currently there's only one test so having all the context in one place is good

backend: The backend to use for scoring. Supports "ray" or "mp" (str).
"""

TIMEOUT = 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering how did you pick this value (6 here and 10 for apps)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timeout value is actually from the original source code for the datasets.

Copy link
Collaborator

@erictang000 erictang000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

backend: The backend to use for scoring. Supports "ray" or "mp" (str).
"""

TIMEOUT = 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also should this be a class constant or could it make sense for this to be a user configurable parameter passed into the constructor?

@@ -0,0 +1,272 @@
"""
This is the recipe for data curation for the Sky T1 Preview model .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is the recipe for data curation for the Sky T1 Preview model .
This is the recipe for data curation for the Sky T1 Preview model.


config = vLLMEngineProcessorConfig(
model="Qwen/QwQ-32B-Preview",
# model="Qwen/Qwen2-0.5B-Instruct",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will do. recipe.py is still under construction.

@SumanthRH SumanthRH marked this pull request as ready for review March 20, 2025 18:43
@SumanthRH SumanthRH merged commit 75712f6 into main Mar 20, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants