feat: Minor updates to multi modal prompts as per inputs from quality team

vertex-sdk-bot · copybara-github · commit dae3eed58edf · 2025-04-05T23:37:45.000-07:00
PiperOrigin-RevId: 744079202
diff --git a/tests/unit/vertexai/test_rubric_based_eval.py b/tests/unit/vertexai/test_rubric_based_eval.py
@@ -65,6 +65,7 @@
             ),
         ],
         "response": ["test", "text"],
+        "description": ["description", "description"],
     }
 )
 _TEST_PAIRWISE_MULTIMODAL_EVAL_DATASET = pd.DataFrame(
@@ -312,6 +313,7 @@ def test_pointwise_multimodal_understanding_metric(self):
                 "prompt",
                 "image",
                 "response",
+                "description",
                 "rubrics",
                 "rb_multimodal_understanding/score",
                 "rb_multimodal_understanding/rubric_verdict_pairs",
diff --git a/vertexai/preview/evaluation/metrics/_default_templates.py b/vertexai/preview/evaluation/metrics/_default_templates.py
@@ -1043,6 +1043,7 @@
 
 REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model! You must strictly adhere to the output format in this prompt.
 Evaluation:
+
 <question>
 """
 PAIRWISE_INSTRUCTION_FOLLOWING_RUBRIC_CRITIQUE_TEMPLATE = """# Instructions
@@ -1113,13 +1114,18 @@
 """
 
 MULTIMODAL_UNDERSTANDING_RUBRIC_GENERATION_PROMPT_TEMPLATE = """# Instructions
-Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, to generate rubrics for an image and user prompt that describes the properties that should hold for a good response to that prompt. Generate the rubric following the provided guidelines.
+Your task is to generate a rubric that can be used to evaluate the image understanding quality of responses generated by an AI model. Specifically, you will generate a checklist of questions for a user prompt and its associated image that will measure how complete and correct a model response is based on how many questions it satisfies. The image and user prompt are provided below.
+
+
+# Rubric Guidelines
 
-First, describe the contents of the image thoroughly, making sure to document all of the important objects and their interactions with each other and the scenery. Then, thoroughly examine the prompt and decompose its individual instructions into a list of yes/no questions. Be as specific and concise as possible for each question. Ensure each question directly relates to the image and infer the connection if it is not explicitly stated.
+Generate the rubric following the provided guidelines below.
+
+First, describe the contents of the image thoroughly, making sure to document all of the important objects and their interactions with each other and the scenery. Then, thoroughly examine the prompt and decompose its individual instructions into a numbered list. Be as specific and concise as possible for each question. Ensure each question directly relates to the image and infer the connection if it is not explicitly stated.
 
 For example, if the prompt is:
 
-**Provide a list of all the following attributes: ingredients, type of cuisine, vegetarian or not, in JSON format**
+"Provide a list of all the following attributes: ingredients, type of cuisine, vegetarian or not, in JSON format"
 
 Then you will return:
 
@@ -1128,27 +1134,32 @@
 [[Description of image]]
 The image shows a skillet filled with cooked food. There are four chicken breasts arranged inside the skillet sitting in a creamy pale sauce. Between the chicken lie pieces of green asparagus. Slices of lemon are also interspersed within the skillet. The skillet itself appears to be a dark color, possibly cast iron, and has a teal handle. The background has a mottled grey texture.
 [[Questions]]
- 1. Does the response correctly determine all of the ingredients in the image?
- 2. Does the response correctly state the type of cuisine seen in the image?
- 3. Does the response correctly judge whether or not the displayed cuisine is vegetarian?
- 4. Does the response correctly display the above three properties as a properly formatted JSON list?
+1. Determine all of the ingredients seen in the image
+2. Determine the type of cuisine seen in the image
+3. Determine if the displayed cuisine is vegetarian or not
+4. Display the above three properties as a JSON list
+
 ---
 
-When you are finished, REVIEW the image, the prompt, and your rubric and determine if 1. your description of the image is correct and 2. every instruction contained in the prompt is fully covered by your rubric and is relevant to the image. If not, correct your mistakes and output a revised answer in addition to the previous answer. Increment the iter field by 1 each time. Do not exceed 3 iterations in total.
+# Recursive Self-Refinement
+
+When you are finished, REVIEW the image, the prompt, and your rubric and determine if 1. your description of the image is correct and 2. every instruction contained in the prompt is fully covered by your rubric and is relevant to the image. If not, correct your mistakes and output a revised answer in addition to the previous answer. Increment the "iter" field by 1 each time. Do not exceed 5 iterations in total.
 Repeat the above process until you are satisfied your description of the image is accurate and your rubric fully covers and accurately represents the original prompt instructions, or if there is no significant difference between your last answer and its preceding answer. Append each revised answer below the previous answer at each iteration.
-When you are confident your answer is correct or when max iterations are reached, output [[FINAL ANSWER]] and repeat your final answer at the very bottom. Example:
+When you are confident your answer is correct or when max iterations are reached, output [[FINAL ANSWER]] and repeat your final answer after all your iterations. Example:
 
 ---
 [[FINAL ANSWER]]
 [[Description of image]]
 The image is a close-up, overhead shot of a black skillet with a teal handle filled with cooked food. The main elements are four, cooked chicken breasts which are browned and drizzled with a creamy white sauce. These breasts are resting in a light yellow, creamy sauce. Green asparagus spears and slices of lemon are interspersed and arranged around the chicken. The entire skillet is sitting on a textured grey surface.
 [[Questions]]
- 1. Does the response correctly determine all of the ingredients in the image?
- 2. Does the response correctly state the type of cuisine seen in the image?
- 3. Does the response correctly judge whether or not the displayed cuisine is vegetarian?
- 4. Does the response correctly display the above three properties as a properly formatted JSON list?
+1. Determine all of the ingredients of the entree seen in the image
+2. Determine the type of cuisine seen in the image
+3. Determine if the displayed cuisine is vegetarian or not
+4. Display the above three properties as a JSON list
 ---
 
+# JSON return format
+
 Finally, translate the description and questions of your final answer into JSON format according to this schema:
 
 ```json
@@ -1204,9 +1215,7 @@
 # User Inputs, AI-generated Response, and Rubrics
 ## User Inputs
 ### Image
-<MM_IMAGE>
-{image}
-</MM_IMAGE>
+<MM_IMAGE>{image}</MM_IMAGE>
 
 ### Prompt
 {prompt}
@@ -1220,46 +1229,48 @@
 REMEMBER: Your answer will help improve the AI model. It is important to answer the question correctly. Even answering "no" will improve the model!
 
 Evaluation:
+
 <question>
 """
 PAIRWISE_MULTIMODAL_UNDERSTANDING_RUBRIC_CRITIQUE_TEMPLATE = """# Instructions:
 
-Your task is to evaluate the image understanding quality of responses generated by two AI models. At the bottom of this system instruction you will be presented with an image, a text description of that image, a user prompt, and the responses of Model A and Model B to that user prompt. You will also be provided a rubric as a numbered list against which Response A and Response B will be judged. Each rubricv question is a list of instructions that each response must follow in order to satisfy the user prompt.
+Your task is to evaluate the image understanding quality of responses generated by two AI models. You will be presented with an image, a text description of that image, a user prompt, and the responses of Model A and Model B to that user prompt. You will also be provided a rubric against which Response A and Response B will be judged. The rubric is a list of questions that each response must follow in order to satisfy the user prompt.
 
 # Rubric Scoring:
 
-For each response, rephrase every rubric point as a question and answer [[YES]] or [[NO]] to each point. Then, display the rubric grade as the sum of the correct rubric points over the total number of points. Finally, score the response on a scale of 1 to 5 stars based on how enjoyable you think it is for a human to read and understand and state your reasoning.
+For each response, rephrase every rubric point as a question and answer [[YES]] or [[NO]] to each point. Then, display the rubric grade as the sum of the correct rubric points over the total number of points. Then, write a paragraph analyzing how enjoyable this response would be to read for a human, taking into account factors like grammar, tone, relevance, and ease of comprehension. Finally, score this response on a scale of 1 to 5 stars based on how enjoyable you think it is for a human to read and understand.
 
 For example, if the rubric questions are:
 [[Rubric]]
-<question>Does the response correctly determine all of the ingredients in the image?
-<question>Does the response correctly state the type of cuisine seen in the image?
-<question>Does the response correctly judge whether or not the displayed cuisine is vegetarian?
-<question>Does the response correctly display the above three properties as a properly formatted JSON list?
+<question>Determine all of the ingredients seen in the image.
+<question>Determine the type of cuisine seen in the image.
+<question>Determine if the displayed cuisine is vegetarian or not
+<question>Display the above three properties as a JSON list
+
 
 Then you will score Response A as:
 
 [[Response A Answers:]]
 <question>
-Question: Does the response correctly determine all of the ingredients in the image?
+Question: Does the Response A correctly determine all of the ingredients in the image?
 Verdict: [[YES]]
 </question>
 <question>
-Question: Does the response correctly state the type of cuisine seen in the image?
+Question: Does the Response A correctly state the type of cuisine seen in the image?
 Verdict: [[NO]]
 </question>
 <question>
-Question: Does the response correctly judge whether or not the displayed cuisine is vegetarian?
+Question: Does the Response A correctly judge whether or not the displayed cuisine is vegetarian?
 Verdict: [[YES]]
 </question>
 <question>
-Question: Does the response correctly display the above three properties as a properly formatted JSON list?
+Question: Does the Response A correctly display the above three properties as a properly formatted JSON list?
 Verdict: [[NO]]
 </question>
 
 [[Rubric Score: 2/4]]
-[[Human Enjoyment Rating: 4 stars]]
 [[Human Rating Reason: This response is accurate and has no grammatical errors but feels too verbose and formal.]]
+[[Human Enjoyment Rating: 4 stars]]
 
 Repeat the above for Response B.