Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prediction variable name customization to LLM as Judge #1622

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

martinscooper
Copy link
Collaborator

This PR implements customizing the prediction name, allowing to give the evaluator more information about what exactly is being evaluated.

Currently, the prompts refers to a generic 'response'. This PR allows the prompt to refer to other words like: 'text', 'prediction', 'assistant_response', 'summary', etc.

The customization can be done in two ways:

  • setting LLMJudge.response_variable_name, which defaults to 'response', so that this is backward compatible, or
  • setting LLMJudge.response_variable_name_field and including a key with name being the value of response_variable_name_field to every task_data instance.

@martinscooper martinscooper force-pushed the llm-judge-response-name branch from 7917e11 to 8b0d2c9 Compare February 21, 2025 16:58
Signed-off-by: Martín Santillán Cooper <[email protected]>
@martinscooper martinscooper force-pushed the llm-judge-response-name branch from 090ff22 to eca7455 Compare February 21, 2025 17:31
@martinscooper
Copy link
Collaborator Author

Example usage:

from typing import Any, List

from unitxt import evaluate, load_dataset
from unitxt.blocks import Task, TaskCard
from unitxt.llm_as_judge import CreateCriteriaFromString
from unitxt.loaders import LoadFromDictionary
from unitxt.templates import NullTemplate

data = {
    "test": [
        {
            "question": "How is the weather?",
            "judgement": "The temperature is described in both Fahrenheit and Celsius.",
            "response_variable_name": "assistant response"
            
        },
        {
            "question": "Tell me a joke about cats",
            "judgement": "Is the response funny?",
            "response_variable_name": "joke"
        },
    ]
}

card = TaskCard(
    loader=LoadFromDictionary(data=data, data_classification_policy=["public"]),
    preprocess_steps=[
        CreateCriteriaFromString(field="judgement", to_field="criteria"),
    ],
    task=Task(
        input_fields={"question": str, "response_variable_name": str},
        reference_fields={"criteria": Any},
        prediction_type=List[str],
        metrics=[
            "metrics.llm_as_judge.pairwise.watsonx.llama3_1_70b[context_fields=question,criteria_field=criteria,response_variable_name_field=response_variable_name,include_prompts_in_result=True]"
        ],
        default_template=NullTemplate(),
    ),
)

test_dataset = load_dataset(card=card, split="test")

predictions = [
    [
        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
        """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
        """On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
    ],
    [
        """Why did the cat cross the road? To cat to the other side.""",
        """Why did the cat sit on the computer? Because it wanted to keep an eye on the mouse!""",
        """What is red, yellow and green? A traffic light.""",
    ],
]

results = evaluate(predictions=predictions, data=test_dataset)

print("Global Scores:")
print(results.global_scores.summary)

print("Instance Scores:")
print(results.instance_scores.summary)

@martinscooper martinscooper requested review from OfirArviv, elronbandel and yoavkatz and removed request for OfirArviv February 21, 2025 17:31
@martinscooper
Copy link
Collaborator Author

Assessment prompt for "response_variable_name": "assistant response":

[{'role': 'user', 'content': 'You are provided a pair of assistant responses (Assistant response 1 and Assistant response 2) generated subject to a context.\nYou will choose the better quality assistant response subject to the evaluation criteria.\n\nThis is the context:\nquestion: How is the weather?\n\nThis is the evaluation criteria:\n\nThe temperature is described in both Fahrenheit and Celsius.\n\nAssistant response 1:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\nAssistant response 2:\nOn most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.\n\nKeeping the evaluation criteria in mind, briefly assess which assistant response is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}]

Assessment prompt for "response_variable_name": "joke":

[{'role': 'user', 'content': 'You are provided a pair of jokes (Joke 1 and Joke 2) generated subject to a context.\nYou will choose the better quality joke subject to the evaluation criteria.\n\nThis is the context:\nquestion: Tell me a joke about cats\n\nThis is the evaluation criteria:\n\nIs the response funny?\n\nJoke 1:\nWhy did the cat cross the road? To cat to the other side.\nJoke 2:\nWhy did the cat sit on the computer? Because it wanted to keep an eye on the mouse!\n\nKeeping the evaluation criteria in mind, briefly assess which joke is better.\nFocus on the evaluation criteria during assessment, do not provide a general assessment.\nAssessment:\n\nLets think step by step '}]

@martinscooper
Copy link
Collaborator Author

The evaluator's prompts use the plural form of the response variable name though. I am currently just appending an 's' after it, but this fails for words like 'summaries'.

@yoavkatz
Copy link
Member

I think the main point is whether it makes a difference in performance (because it complicates the API and the interface).

@martinscooper martinscooper marked this pull request as draft March 12, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants