-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prediction variable name customization to LLM as Judge #1622
base: main
Are you sure you want to change the base?
Conversation
7917e11
to
8b0d2c9
Compare
Signed-off-by: Martín Santillán Cooper <[email protected]>
090ff22
to
eca7455
Compare
Example usage: from typing import Any, List
from unitxt import evaluate, load_dataset
from unitxt.blocks import Task, TaskCard
from unitxt.llm_as_judge import CreateCriteriaFromString
from unitxt.loaders import LoadFromDictionary
from unitxt.templates import NullTemplate
data = {
"test": [
{
"question": "How is the weather?",
"judgement": "The temperature is described in both Fahrenheit and Celsius.",
"response_variable_name": "assistant response"
},
{
"question": "Tell me a joke about cats",
"judgement": "Is the response funny?",
"response_variable_name": "joke"
},
]
}
card = TaskCard(
loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
preprocess_steps=[
CreateCriteriaFromString(field="judgement", to_field="criteria"),
],
task=Task(
input_fields={"question": str, "response_variable_name": str},
reference_fields={"criteria": Any},
prediction_type=List[str],
metrics=[
"metrics.llm_as_judge.pairwise.watsonx.llama3_1_70b[context_fields=question,criteria_field=criteria,response_variable_name_field=response_variable_name,include_prompts_in_result=True]"
],
default_template=NullTemplate(),
),
)
test_dataset = load_dataset(card=card, split="test")
predictions = [
[
"""On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
"""On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
"""On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
],
[
"""Why did the cat cross the road? To cat to the other side.""",
"""Why did the cat sit on the computer? Because it wanted to keep an eye on the mouse!""",
"""What is red, yellow and green? A traffic light.""",
],
]
results = evaluate(predictions=predictions, data=test_dataset)
print("Global Scores:")
print(results.global_scores.summary)
print("Instance Scores:")
print(results.instance_scores.summary) |
Assessment prompt for "response_variable_name": "assistant response": [{'role': 'user', 'content': 'You are provided a pair of assistant responses (Assistant response 1 and Assistant response 2) generated subject to a context.\nYou will choose the better quality assistant response subject to the evaluation criteria.\n\nThis is the context:\nquestion: How is the weather?\n\nThis is the evaluation criteria:\n\nThe temperature is described in both Fahrenheit and Celsius.\n\nAssistant response 1:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\nAssistant response 2:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\n\nKeeping the evaluation criteria in mind, briefly assess which assistant response is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}] Assessment prompt for "response_variable_name": "joke": [{'role': 'user', 'content': 'You are provided a pair of jokes (Joke 1 and Joke 2) generated subject to a context.\nYou will choose the better quality joke subject to the evaluation criteria.\n\nThis is the context:\nquestion: Tell me a joke about cats\n\nThis is the evaluation criteria:\n\nIs the response funny?\n\nJoke 1:\nWhy did the cat cross the road? To cat to the other side.\nJoke 2:\nWhy did the cat sit on the computer? Because it wanted to keep an eye on the mouse!\n\nKeeping the evaluation criteria in mind, briefly assess which joke is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}] |
The evaluator's prompts use the plural form of the response variable name though. I am currently just appending an 's' after it, but this fails for words like 'summaries'. |
I think the main point is whether it makes a difference in performance (because it complicates the API and the interface). |
This PR implements customizing the prediction name, allowing to give the evaluator more information about what exactly is being evaluated.
Currently, the prompts refers to a generic 'response'. This PR allows the prompt to refer to other words like: 'text', 'prediction', 'assistant_response', 'summary', etc.
The customization can be done in two ways:
LLMJudge.response_variable_name
, which defaults to 'response', so that this is backward compatible, orLLMJudge.response_variable_name_field
and including a key with name being the value ofresponse_variable_name_field
to every task_data instance.