Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Minor updates to multi modal prompts as per inputs from quality team #5147

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions tests/unit/vertexai/test_rubric_based_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
),
],
"response": ["test", "text"],
"description": ["description", "description"],
}
)
_TEST_PAIRWISE_MULTIMODAL_EVAL_DATASET = pd.DataFrame(
Expand Down Expand Up @@ -312,6 +313,7 @@ def test_pointwise_multimodal_understanding_metric(self):
"prompt",
"image",
"response",
"description",
"rubrics",
"rb_multimodal_understanding/score",
"rb_multimodal_understanding/rubric_verdict_pairs",
Expand Down
65 changes: 38 additions & 27 deletions vertexai/preview/evaluation/metrics/_default_templates.py
Original file line number Diff line number Diff line change
Expand Up @@ -1043,6 +1043,7 @@

REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model! You must strictly adhere to the output format in this prompt.
Evaluation:

<question>
"""
PAIRWISE_INSTRUCTION_FOLLOWING_RUBRIC_CRITIQUE_TEMPLATE = """# Instructions
Expand Down Expand Up @@ -1113,13 +1114,18 @@
"""

MULTIMODAL_UNDERSTANDING_RUBRIC_GENERATION_PROMPT_TEMPLATE = """# Instructions
Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, to generate rubrics for an image and user prompt that describes the properties that should hold for a good response to that prompt. Generate the rubric following the provided guidelines.
Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, you will generate a checklist of questions for a user prompt and its associated image that will measure how complete and correct a model response is based on how many questions it satisfies. The image and user prompt are provided below.


# Rubric Guidelines

First, describe the contents of the image thoroughly, making sure to document all of the important objects and their interactions with each other and the scenery. Then, thoroughly examine the prompt and decompose its individual instructions into a list of yes/no questions. Be as specific and concise as possible for each question. Ensure each question directly relates to the image and infer the connection if it is not explicitly stated.
Generate the rubric following the provided guidelines below.

First, describe the contents of the image thoroughly, making sure to document all of the important objects and their interactions with each other and the scenery. Then, thoroughly examine the prompt and decompose its individual instructions into a numbered list. Be as specific and concise as possible for each question. Ensure each question directly relates to the image and infer the connection if it is not explicitly stated.

For example, if the prompt is:

**Provide a list of all the following attributes: ingredients, type of cuisine, vegetarian or not, in JSON format**
"Provide a list of all the following attributes: ingredients, type of cuisine, vegetarian or not, in JSON format"

Then you will return:

Expand All @@ -1128,27 +1134,32 @@
[[Description of image]]
The image shows a skillet filled with cooked food. There are four chicken breasts arranged inside the skillet sitting in a creamy pale sauce. Between the chicken lie pieces of green asparagus. Slices of lemon are also interspersed within the skillet. The skillet itself appears to be a dark color, possibly cast iron, and has a teal handle. The background has a mottled grey texture.
[[Questions]]
1. Does the response correctly determine all of the ingredients in the image?
2. Does the response correctly state the type of cuisine seen in the image?
3. Does the response correctly judge whether or not the displayed cuisine is vegetarian?
4. Does the response correctly display the above three properties as a properly formatted JSON list?
1. Determine all of the ingredients seen in the image
2. Determine the type of cuisine seen in the image
3. Determine if the displayed cuisine is vegetarian or not
4. Display the above three properties as a JSON list

---

When you are finished, REVIEW the image, the prompt, and your rubric and determine if 1. your description of the image is correct and 2. every instruction contained in the prompt is fully covered by your rubric and is relevant to the image. If not, correct your mistakes and output a revised answer in addition to the previous answer. Increment the iter field by 1 each time. Do not exceed 3 iterations in total.
# Recursive Self-Refinement

When you are finished, REVIEW the image, the prompt, and your rubric and determine if 1. your description of the image is correct and 2. every instruction contained in the prompt is fully covered by your rubric and is relevant to the image. If not, correct your mistakes and output a revised answer in addition to the previous answer. Increment the "iter" field by 1 each time. Do not exceed 5 iterations in total.
Repeat the above process until you are satisfied your description of the image is accurate and your rubric fully covers and accurately represents the original prompt instructions, or if there is no significant difference between your last answer and its preceding answer. Append each revised answer below the previous answer at each iteration.
When you are confident your answer is correct or when max iterations are reached, output [[FINAL ANSWER]] and repeat your final answer at the very bottom. Example:
When you are confident your answer is correct or when max iterations are reached, output [[FINAL ANSWER]] and repeat your final answer after all your iterations. Example:

---
[[FINAL ANSWER]]
[[Description of image]]
The image is a close-up, overhead shot of a black skillet with a teal handle filled with cooked food. The main elements are four, cooked chicken breasts which are browned and drizzled with a creamy white sauce. These breasts are resting in a light yellow, creamy sauce. Green asparagus spears and slices of lemon are interspersed and arranged around the chicken. The entire skillet is sitting on a textured grey surface.
[[Questions]]
1. Does the response correctly determine all of the ingredients in the image?
2. Does the response correctly state the type of cuisine seen in the image?
3. Does the response correctly judge whether or not the displayed cuisine is vegetarian?
4. Does the response correctly display the above three properties as a properly formatted JSON list?
1. Determine all of the ingredients of the entree seen in the image
2. Determine the type of cuisine seen in the image
3. Determine if the displayed cuisine is vegetarian or not
4. Display the above three properties as a JSON list
---

# JSON return format

Finally, translate the description and questions of your final answer into JSON format according to this schema:

```json
Expand Down Expand Up @@ -1204,9 +1215,7 @@
# User Inputs, AI-generated Response, and Rubrics
## User Inputs
### Image
<MM_IMAGE>
{image}
</MM_IMAGE>
<MM_IMAGE>{image}</MM_IMAGE>

### Prompt
{prompt}
Expand All @@ -1220,46 +1229,48 @@
REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model!

Evaluation:

<question>
"""
PAIRWISE_MULTIMODAL_UNDERSTANDING_RUBRIC_CRITIQUE_TEMPLATE = """# Instructions:

Your task is to evaluate the image understanding quality of responses generated by two AI models. At the bottom of this system instruction you will be presented with an image, a text description of that image, a user prompt, and the responses of Model A and Model B to that user prompt. You will also be provided a rubric as a numbered list against which Response A and Response B will be judged. Each rubricv question is a list of instructions that each response must follow in order to satisfy the user prompt.
Your task is to evaluate the image understanding quality of responses generated by two AI models. You will be presented with an image, a text description of that image, a user prompt, and the responses of Model A and Model B to that user prompt. You will also be provided a rubric against which Response A and Response B will be judged. The rubric is a list of questions that each response must follow in order to satisfy the user prompt.

# Rubric Scoring:

For each response, rephrase every rubric point as a question and answer [[YES]] or [[NO]] to each point. Then, display the rubric grade as the sum of the correct rubric points over the total number of points. Finally, score the response on a scale of 1 to 5 stars based on how enjoyable you think it is for a human to read and understand and state your reasoning.
For each response, rephrase every rubric point as a question and answer [[YES]] or [[NO]] to each point. Then, display the rubric grade as the sum of the correct rubric points over the total number of points. Then, write a paragraph analyzing how enjoyable this response would be to read for a human, taking into account factors like grammar, tone, relevance, and ease of comprehension. Finally, score this response on a scale of 1 to 5 stars based on how enjoyable you think it is for a human to read and understand.

For example, if the rubric questions are:
[[Rubric]]
<question>Does the response correctly determine all of the ingredients in the image?
<question>Does the response correctly state the type of cuisine seen in the image?
<question>Does the response correctly judge whether or not the displayed cuisine is vegetarian?
<question>Does the response correctly display the above three properties as a properly formatted JSON list?
<question>Determine all of the ingredients seen in the image.
<question>Determine the type of cuisine seen in the image.
<question>Determine if the displayed cuisine is vegetarian or not
<question>Display the above three properties as a JSON list


Then you will score Response A as:

[[Response A Answers:]]
<question>
Question: Does the response correctly determine all of the ingredients in the image?
Question: Does the Response A correctly determine all of the ingredients in the image?
Verdict: [[YES]]
</question>
<question>
Question: Does the response correctly state the type of cuisine seen in the image?
Question: Does the Response A correctly state the type of cuisine seen in the image?
Verdict: [[NO]]
</question>
<question>
Question: Does the response correctly judge whether or not the displayed cuisine is vegetarian?
Question: Does the Response A correctly judge whether or not the displayed cuisine is vegetarian?
Verdict: [[YES]]
</question>
<question>
Question: Does the response correctly display the above three properties as a properly formatted JSON list?
Question: Does the Response A correctly display the above three properties as a properly formatted JSON list?
Verdict: [[NO]]
</question>

[[Rubric Score: 2/4]]
[[Human Enjoyment Rating: 4 stars]]
[[Human Rating Reason: This response is accurate and has no grammatical errors but feels too verbose and formal.]]
[[Human Enjoyment Rating: 4 stars]]

Repeat the above for Response B.

Expand Down