Evaluation of user data using Unitxt

Here is the suggested flow. Let's discuss in a meeting to see it makes sense and modify as needed: 

Evaluation command `[ilab model evaluate new_data`] will have the following parameters:

1 - `csv_path` for user data

- 	A csv file with two required column ('instruction','input') and two optional columns ('answer','context')
- 	'context' column is for the RAG task only
- 	'answer' is the golden truth, if available
- 	'instruction' explain the task ("Summarize this text", "Complete this sentence", "Classify this input to one of the following: ... ")

	
2 - `task_type` out of the following options:

- 	Classification
- 	Question Answering [Let's discuss if we want to explicitly offer QA multichoice and simple QA as two separate options]
- 	Summarization
- 	Generation
- 	RAG
- 	Other [Let's discuss if this can actually be removed as it will get the same treatment as QA behind the scenes]

3 - `use_llmaaj` (False by default)

- 	False - default standard metric for the task 
    - for some tasks the default is llmaaj to begin with: QA, Generation, Other
    - llmaaj will be used if golden answers are not available
- 	True - Uses judge (with templates pre defined for the task type)

4 - `num_shots` (0 by default)
-   Do we want to allow the user select num shots?
-   Do we want to drop this option, run a few configurations (0, 2, 5 shots) and inform the user which is the best setting?
 

Following the command, unitxt will  run the provided data with the task of choice, and replace the metric if llmaaj is selected.
The data will be run in multiple configurations (fitted into different templates that match the task).
Results will include a recommendation for the best template of those used. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation of user data using Unitxt #176

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation of user data using Unitxt #176

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions