-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Notes from deeplearning.ai Evaluating AI Agents course by Arize
March 2025
https://learn.deeplearning.ai/courses/evaluating-ai-agents/
https://github.com/Arize-ai/phoenix
Introduction
https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/sqkza/introduction
Evaluation-in-the-time-of-llms
Decomposing-agents
https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/hvoxa/decomposing-agents
Lab 1: building-your-agent
https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/pag5y/lab-1:-building-your-agent
notebook
Tracing-agents
https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/uymu6/tracing-agents
Lab 2: Tracing your agent
https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/njjlv/lab-2:-tracing-your-agent
adding-router-and-skill-evaluations
Lab3: adding-router-and-skill-evaluations (Very useful!)
Workflow is to run agent on example input data, export corresponding spans from the span traces (via code), then llm-as-judge on them.
TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.
[BEGIN DATA]
************
[Question]: {question}
************
[Tool Called]: {tool_call}
[END DATA]
Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.
"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.
[Tool Definitions]: {tool_definitions}
"""
CLARITY_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to evaluate the clarity
of the answer in addressing the query. A clear response is one that is precise, coherent, and directly
addresses the query without introducing unnecessary complexity or ambiguity. An unclear response is one
that is vague, disorganized, or difficult to understand, even if it may be factually correct.
Your response should be a single word: either "clear" or "unclear," and it should not include any other
text or characters. "clear" indicates that the answer is well-structured, easy to understand, and
appropriately addresses the query. "unclear" indicates that some part of the response could be better
structured or worded.
Please carefully consider the query and answer before determining your response.
After analyzing the query and the answer, you must write a detailed explanation of your reasoning to
justify why you chose either "clear" or "unclear." Avoid stating the final label at the beginning of your
explanation. Your reasoning should include specific points about how the answer does or does not meet the
criteria for clarity.
[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.
EXPLANATION: Provide your reasoning step by step, evaluating the clarity of the answer based on the query.
LABEL: "clear" or "unclear"
"""
SQL_EVAL_GEN_PROMPT = """
SQL Evaluation Prompt:
-----------------------
You are tasked with determining if the SQL generated appropiately answers a given instruction
taking into account its generated query and response.
Data:
-----
- [Instruction]: {question}
This section contains the specific task or problem that the sql query is intended to solve.
- [Reference Query]: {query_gen}
This is the sql query submitted for evaluation. Analyze it in the context of the provided
instruction.
Evaluation:
-----------
Your response should be a single word: either "correct" or "incorrect".
You must assume that the db exists and that columns are appropiately named.
You must take into account the response as additional information to determine the correctness.
- "correct" indicates that the sql query correctly solves the instruction.
- "incorrect" indicates that the sql query correctly does not solve the instruction correctly.
Note: Your response should contain only the word "correct" or "incorrect" with no additional text
or characters.
"""
Adding-trajectory-evaluations
Lab 4: adding-trajectory-evaluations
Adding-structure-to-your-evaluations
Improving LLM-as-a-judge
Monitoring
https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/y5v5y/monitoring-agents