Skip to content

Notes from deeplearning.ai Evaluating AI Agents course #1

@dylanhogg

Description

@dylanhogg

Notes from deeplearning.ai Evaluating AI Agents course by Arize

March 2025

https://learn.deeplearning.ai/courses/evaluating-ai-agents/

https://github.com/Arize-ai/phoenix

Introduction

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/sqkza/introduction

Image

Image

Evaluation-in-the-time-of-llms

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/gz8ys/evaluation-in-the-time-of-llms

Image

Image

Image

Image

Image

Image

Image

Image

Decomposing-agents

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/hvoxa/decomposing-agents

Image

Image

Image

Image

Image

Image

Lab 1: building-your-agent

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/pag5y/lab-1:-building-your-agent

notebook

Tracing-agents

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/uymu6/tracing-agents

Image

Image

Image

Lab 2: Tracing your agent

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/njjlv/lab-2:-tracing-your-agent

Image

Image

adding-router-and-skill-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/x3i1d/adding-router-and-skill-evaluations

Image

Image

Image

Image

Image

Image

Image

Image

Image

Lab3: adding-router-and-skill-evaluations (Very useful!)

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/yx7uz/lab-3:-adding-router-and-skill-evaluations

Workflow is to run agent on example input data, export corresponding spans from the span traces (via code), then llm-as-judge on them.

Image

Image

Image

TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

    [Tool Definitions]: {tool_definitions}
"""
CLARITY_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to evaluate the clarity 
of the answer in addressing the query. A clear response is one that is precise, coherent, and directly 
addresses the query without introducing unnecessary complexity or ambiguity. An unclear response is one 
that is vague, disorganized, or difficult to understand, even if it may be factually correct.

Your response should be a single word: either "clear" or "unclear," and it should not include any other 
text or characters. "clear" indicates that the answer is well-structured, easy to understand, and 
appropriately addresses the query. "unclear" indicates that some part of the response could be better 
structured or worded.
Please carefully consider the query and answer before determining your response.

After analyzing the query and the answer, you must write a detailed explanation of your reasoning to 
justify why you chose either "clear" or "unclear." Avoid stating the final label at the beginning of your 
explanation. Your reasoning should include specific points about how the answer does or does not meet the 
criteria for clarity.

[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.

EXPLANATION: Provide your reasoning step by step, evaluating the clarity of the answer based on the query.
LABEL: "clear" or "unclear"
"""
SQL_EVAL_GEN_PROMPT = """
SQL Evaluation Prompt:
-----------------------
You are tasked with determining if the SQL generated appropiately answers a given instruction
taking into account its generated query and response.

Data:
-----
- [Instruction]: {question}
  This section contains the specific task or problem that the sql query is intended to solve.

- [Reference Query]: {query_gen}
  This is the sql query submitted for evaluation. Analyze it in the context of the provided
  instruction.

Evaluation:
-----------
Your response should be a single word: either "correct" or "incorrect".
You must assume that the db exists and that columns are appropiately named.
You must take into account the response as additional information to determine the correctness.

- "correct" indicates that the sql query correctly solves the instruction.
- "incorrect" indicates that the sql query correctly does not solve the instruction correctly.

Note: Your response should contain only the word "correct" or "incorrect" with no additional text
or characters.
"""

Image

Adding-trajectory-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/xdfwr/adding-trajectory-evaluations

Image

Image

Image

Image

Image

Image

Lab 4: adding-trajectory-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/an0wh/lab-4:-adding-trajectory-evaluations

Image

Image

Adding-structure-to-your-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/a2c54/adding-structure-to-your-evaluations

Image

Image

Image

Image

Image

Image

Image

Improving LLM-as-a-judge

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/f63n9/improving-your-llm-as-a-judge

Image

Image

Monitoring

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/y5v5y/monitoring-agents

Image

Image

Image

Image

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions