Introduce new scoring APIs for curation + training#88
Conversation
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
recipes/sky-t1-preview/recipe.py
Outdated
| # We explicitly set the target number of blocks to help tune performance. | ||
| # For materialized datasets, the number of blocks determined by ray data can be small, | ||
| # especially for a multi-stage pipeline like the one here. | ||
| TARGET_NUM_ROWS_PER_BLOCK = 100 |
There was a problem hiding this comment.
I'm still tuning settings like these for performance
| @timeout(5) # Add timeout of 5 seconds | ||
| def check_correctness(self, problem, generation): | ||
| solution = extract_answer(problem[self.task_config.answer_key]) | ||
| solution = strip_answer_string(solution) |
There was a problem hiding this comment.
not needed because strip_answer_string is already called in extract_answer
| return dataset.iloc[start:end] if end > 0 else dataset.iloc[start:] | ||
|
|
||
|
|
||
| def _temp_run(problem, generation, debug, result): |
There was a problem hiding this comment.
Placed outside to solve the same issue as in #89
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
recipes/sky-t1-preview/recipe.py
Outdated
| numina_ds_olympiads = numina_ds_olympiads.limit(num_samples) | ||
| numina_ds_math = numina_ds_math.limit(num_samples) | ||
|
|
||
| # 2. Get model responses for each of the datasets |
|
|
||
| def convert_to_sharegpt_format(row: Dict[str, Any], prompt_column, response_column): | ||
| prompt = row[prompt_column] | ||
| # accept |
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
There was a problem hiding this comment.
Will move the actual example to a config file once other tests also make it in. Currently there's only one test so having all the context in one place is good
| backend: The backend to use for scoring. Supports "ray" or "mp" (str). | ||
| """ | ||
|
|
||
| TIMEOUT = 6 |
There was a problem hiding this comment.
just wondering how did you pick this value (6 here and 10 for apps)?
There was a problem hiding this comment.
the timeout value is actually from the original source code for the datasets.
| backend: The backend to use for scoring. Supports "ray" or "mp" (str). | ||
| """ | ||
|
|
||
| TIMEOUT = 6 |
There was a problem hiding this comment.
also should this be a class constant or could it make sense for this to be a user configurable parameter passed into the constructor?
| @@ -0,0 +1,272 @@ | |||
| """ | |||
| This is the recipe for data curation for the Sky T1 Preview model . | |||
There was a problem hiding this comment.
| This is the recipe for data curation for the Sky T1 Preview model . | |
| This is the recipe for data curation for the Sky T1 Preview model. |
recipes/sky-t1-preview/recipe.py
Outdated
|
|
||
| config = vLLMEngineProcessorConfig( | ||
| model="Qwen/QwQ-32B-Preview", | ||
| # model="Qwen/Qwen2-0.5B-Instruct", |
There was a problem hiding this comment.
Yes will do. recipe.py is still under construction.
Signed-off-by: SumanthRH <sumanthrh@anyscale.com>
PR to introduce the new scoring APIs to be shared between evaluation + curation + training. Also adds an example for using this with the new ray.data.llm APIs: docs.ray.io/en/master/data/working-with-llms.html using Sky-T1-32B-Preview data curation. The recipe is still WIP and has some rough edges on the original training dataset size, which will be fixed in a follow-up PR.
What does this PR do?
WIP PR to introduce the new scoring APIs to be shared between evaluation + curation + training.
Also adds an example for using this with the new
ray.data.llmAPIs: docs.ray.io/en/master/data/working-with-llms.html using Sky-T1-32B-Preview data curation.TODO: