Skip to content

Commit dae3eed

Browse files
vertex-sdk-botcopybara-github
authored andcommitted
feat: Minor updates to multi modal prompts as per inputs from quality team
PiperOrigin-RevId: 744079202
1 parent 9f21b73 commit dae3eed

File tree

2 files changed

+40
-27
lines changed

2 files changed

+40
-27
lines changed

tests/unit/vertexai/test_rubric_based_eval.py

+2
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
),
6666
],
6767
"response": ["test", "text"],
68+
"description": ["description", "description"],
6869
}
6970
)
7071
_TEST_PAIRWISE_MULTIMODAL_EVAL_DATASET = pd.DataFrame(
@@ -312,6 +313,7 @@ def test_pointwise_multimodal_understanding_metric(self):
312313
"prompt",
313314
"image",
314315
"response",
316+
"description",
315317
"rubrics",
316318
"rb_multimodal_understanding/score",
317319
"rb_multimodal_understanding/rubric_verdict_pairs",

vertexai/preview/evaluation/metrics/_default_templates.py

+38-27
Original file line numberDiff line numberDiff line change
@@ -1043,6 +1043,7 @@
10431043
10441044
REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model! You must strictly adhere to the output format in this prompt.
10451045
Evaluation:
1046+
10461047
<question>
10471048
"""
10481049
PAIRWISE_INSTRUCTION_FOLLOWING_RUBRIC_CRITIQUE_TEMPLATE = """# Instructions
@@ -1113,13 +1114,18 @@
11131114
"""
11141115

11151116
MULTIMODAL_UNDERSTANDING_RUBRIC_GENERATION_PROMPT_TEMPLATE = """# Instructions
1116-
Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, to generate rubrics for an image and user prompt that describes the properties that should hold for a good response to that prompt. Generate the rubric following the provided guidelines.
1117+
Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, you will generate a checklist of questions for a user prompt and its associated image that will measure how complete and correct a model response is based on how many questions it satisfies. The image and user prompt are provided below.
1118+
1119+
1120+
# Rubric Guidelines
11171121
1118-
First, describe the contents of the image thoroughly, making sure to document all of the important objects and their interactions with each other and the scenery. Then, thoroughly examine the prompt and decompose its individual instructions into a list of yes/no questions. Be as specific and concise as possible for each question. Ensure each question directly relates to the image and infer the connection if it is not explicitly stated.
1122+
Generate the rubric following the provided guidelines below.
1123+
1124+
First, describe the contents of the image thoroughly, making sure to document all of the important objects and their interactions with each other and the scenery. Then, thoroughly examine the prompt and decompose its individual instructions into a numbered list. Be as specific and concise as possible for each question. Ensure each question directly relates to the image and infer the connection if it is not explicitly stated.
11191125
11201126
For example, if the prompt is:
11211127
1122-
**Provide a list of all the following attributes: ingredients, type of cuisine, vegetarian or not, in JSON format**
1128+
"Provide a list of all the following attributes: ingredients, type of cuisine, vegetarian or not, in JSON format"
11231129
11241130
Then you will return:
11251131
@@ -1128,27 +1134,32 @@
11281134
[[Description of image]]
11291135
The image shows a skillet filled with cooked food. There are four chicken breasts arranged inside the skillet sitting in a creamy pale sauce. Between the chicken lie pieces of green asparagus. Slices of lemon are also interspersed within the skillet. The skillet itself appears to be a dark color, possibly cast iron, and has a teal handle. The background has a mottled grey texture.
11301136
[[Questions]]
1131-
1. Does the response correctly determine all of the ingredients in the image?
1132-
2. Does the response correctly state the type of cuisine seen in the image?
1133-
3. Does the response correctly judge whether or not the displayed cuisine is vegetarian?
1134-
4. Does the response correctly display the above three properties as a properly formatted JSON list?
1137+
1. Determine all of the ingredients seen in the image
1138+
2. Determine the type of cuisine seen in the image
1139+
3. Determine if the displayed cuisine is vegetarian or not
1140+
4. Display the above three properties as a JSON list
1141+
11351142
---
11361143
1137-
When you are finished, REVIEW the image, the prompt, and your rubric and determine if 1. your description of the image is correct and 2. every instruction contained in the prompt is fully covered by your rubric and is relevant to the image. If not, correct your mistakes and output a revised answer in addition to the previous answer. Increment the iter field by 1 each time. Do not exceed 3 iterations in total.
1144+
# Recursive Self-Refinement
1145+
1146+
When you are finished, REVIEW the image, the prompt, and your rubric and determine if 1. your description of the image is correct and 2. every instruction contained in the prompt is fully covered by your rubric and is relevant to the image. If not, correct your mistakes and output a revised answer in addition to the previous answer. Increment the "iter" field by 1 each time. Do not exceed 5 iterations in total.
11381147
Repeat the above process until you are satisfied your description of the image is accurate and your rubric fully covers and accurately represents the original prompt instructions, or if there is no significant difference between your last answer and its preceding answer. Append each revised answer below the previous answer at each iteration.
1139-
When you are confident your answer is correct or when max iterations are reached, output [[FINAL ANSWER]] and repeat your final answer at the very bottom. Example:
1148+
When you are confident your answer is correct or when max iterations are reached, output [[FINAL ANSWER]] and repeat your final answer after all your iterations. Example:
11401149
11411150
---
11421151
[[FINAL ANSWER]]
11431152
[[Description of image]]
11441153
The image is a close-up, overhead shot of a black skillet with a teal handle filled with cooked food. The main elements are four, cooked chicken breasts which are browned and drizzled with a creamy white sauce. These breasts are resting in a light yellow, creamy sauce. Green asparagus spears and slices of lemon are interspersed and arranged around the chicken. The entire skillet is sitting on a textured grey surface.
11451154
[[Questions]]
1146-
1. Does the response correctly determine all of the ingredients in the image?
1147-
2. Does the response correctly state the type of cuisine seen in the image?
1148-
3. Does the response correctly judge whether or not the displayed cuisine is vegetarian?
1149-
4. Does the response correctly display the above three properties as a properly formatted JSON list?
1155+
1. Determine all of the ingredients of the entree seen in the image
1156+
2. Determine the type of cuisine seen in the image
1157+
3. Determine if the displayed cuisine is vegetarian or not
1158+
4. Display the above three properties as a JSON list
11501159
---
11511160
1161+
# JSON return format
1162+
11521163
Finally, translate the description and questions of your final answer into JSON format according to this schema:
11531164
11541165
```json
@@ -1204,9 +1215,7 @@
12041215
# User Inputs, AI-generated Response, and Rubrics
12051216
## User Inputs
12061217
### Image
1207-
<MM_IMAGE>
1208-
{image}
1209-
</MM_IMAGE>
1218+
<MM_IMAGE>{image}</MM_IMAGE>
12101219
12111220
### Prompt
12121221
{prompt}
@@ -1220,46 +1229,48 @@
12201229
REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model!
12211230
12221231
Evaluation:
1232+
12231233
<question>
12241234
"""
12251235
PAIRWISE_MULTIMODAL_UNDERSTANDING_RUBRIC_CRITIQUE_TEMPLATE = """# Instructions:
12261236
1227-
Your task is to evaluate the image understanding quality of responses generated by two AI models. At the bottom of this system instruction you will be presented with an image, a text description of that image, a user prompt, and the responses of Model A and Model B to that user prompt. You will also be provided a rubric as a numbered list against which Response A and Response B will be judged. Each rubricv question is a list of instructions that each response must follow in order to satisfy the user prompt.
1237+
Your task is to evaluate the image understanding quality of responses generated by two AI models. You will be presented with an image, a text description of that image, a user prompt, and the responses of Model A and Model B to that user prompt. You will also be provided a rubric against which Response A and Response B will be judged. The rubric is a list of questions that each response must follow in order to satisfy the user prompt.
12281238
12291239
# Rubric Scoring:
12301240
1231-
For each response, rephrase every rubric point as a question and answer [[YES]] or [[NO]] to each point. Then, display the rubric grade as the sum of the correct rubric points over the total number of points. Finally, score the response on a scale of 1 to 5 stars based on how enjoyable you think it is for a human to read and understand and state your reasoning.
1241+
For each response, rephrase every rubric point as a question and answer [[YES]] or [[NO]] to each point. Then, display the rubric grade as the sum of the correct rubric points over the total number of points. Then, write a paragraph analyzing how enjoyable this response would be to read for a human, taking into account factors like grammar, tone, relevance, and ease of comprehension. Finally, score this response on a scale of 1 to 5 stars based on how enjoyable you think it is for a human to read and understand.
12321242
12331243
For example, if the rubric questions are:
12341244
[[Rubric]]
1235-
<question>Does the response correctly determine all of the ingredients in the image?
1236-
<question>Does the response correctly state the type of cuisine seen in the image?
1237-
<question>Does the response correctly judge whether or not the displayed cuisine is vegetarian?
1238-
<question>Does the response correctly display the above three properties as a properly formatted JSON list?
1245+
<question>Determine all of the ingredients seen in the image.
1246+
<question>Determine the type of cuisine seen in the image.
1247+
<question>Determine if the displayed cuisine is vegetarian or not
1248+
<question>Display the above three properties as a JSON list
1249+
12391250
12401251
Then you will score Response A as:
12411252
12421253
[[Response A Answers:]]
12431254
<question>
1244-
Question: Does the response correctly determine all of the ingredients in the image?
1255+
Question: Does the Response A correctly determine all of the ingredients in the image?
12451256
Verdict: [[YES]]
12461257
</question>
12471258
<question>
1248-
Question: Does the response correctly state the type of cuisine seen in the image?
1259+
Question: Does the Response A correctly state the type of cuisine seen in the image?
12491260
Verdict: [[NO]]
12501261
</question>
12511262
<question>
1252-
Question: Does the response correctly judge whether or not the displayed cuisine is vegetarian?
1263+
Question: Does the Response A correctly judge whether or not the displayed cuisine is vegetarian?
12531264
Verdict: [[YES]]
12541265
</question>
12551266
<question>
1256-
Question: Does the response correctly display the above three properties as a properly formatted JSON list?
1267+
Question: Does the Response A correctly display the above three properties as a properly formatted JSON list?
12571268
Verdict: [[NO]]
12581269
</question>
12591270
12601271
[[Rubric Score: 2/4]]
1261-
[[Human Enjoyment Rating: 4 stars]]
12621272
[[Human Rating Reason: This response is accurate and has no grammatical errors but feels too verbose and formal.]]
1273+
[[Human Enjoyment Rating: 4 stars]]
12631274
12641275
Repeat the above for Response B.
12651276

0 commit comments

Comments
 (0)