Notes from deeplearning.ai Evaluating AI Agents course

# Notes from deeplearning.ai Evaluating AI Agents course by Arize

March 2025

https://learn.deeplearning.ai/courses/evaluating-ai-agents/

https://github.com/Arize-ai/phoenix

## Introduction

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/sqkza/introduction

![Image](https://github.com/user-attachments/assets/902db1a2-f9a1-4dda-8402-bd0f1a226344)

![Image](https://github.com/user-attachments/assets/8429095e-035d-4442-b25d-4451758883b7)

## Evaluation-in-the-time-of-llms

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/gz8ys/evaluation-in-the-time-of-llms

![Image](https://github.com/user-attachments/assets/61ac082f-c921-412f-95d8-9afa8c60bc50)

![Image](https://github.com/user-attachments/assets/3e228a8d-3b2c-4c11-8590-7ff842ba1210)

![Image](https://github.com/user-attachments/assets/54832983-8a40-4128-a8ae-282e040f988d)

![Image](https://github.com/user-attachments/assets/9bffabd5-abb5-45e9-8eca-b0927f91f61a)

![Image](https://github.com/user-attachments/assets/a85f10ef-f973-4fea-ad15-b2a139f76775)

![Image](https://github.com/user-attachments/assets/95741023-548b-4b6d-81ef-3ea56fd5a04f)

![Image](https://github.com/user-attachments/assets/8316deca-0b3d-402b-8f35-7a6074b00282)

![Image](https://github.com/user-attachments/assets/900ade67-3d18-4341-bf2b-b1fcf0923d88)

## Decomposing-agents

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/hvoxa/decomposing-agents

![Image](https://github.com/user-attachments/assets/c748ab33-cc03-4d97-bbc5-3d859b6c7300)

![Image](https://github.com/user-attachments/assets/3f21ef61-2c57-4c60-82b5-53ba9b47191a)

![Image](https://github.com/user-attachments/assets/3ef8bd16-5ba1-4971-a6a5-efb7dde933d9)

![Image](https://github.com/user-attachments/assets/6ccdf30a-e927-4a51-ba95-6bbc2127f955)

![Image](https://github.com/user-attachments/assets/6484979c-e547-4323-bc77-8ddad58a992b)

![Image](https://github.com/user-attachments/assets/4aaf58fa-9a66-4ed0-b712-77604aed008f)

## Lab 1: building-your-agent

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/pag5y/lab-1:-building-your-agent

notebook

## Tracing-agents

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/uymu6/tracing-agents

![Image](https://github.com/user-attachments/assets/2539accb-0ac0-494a-862a-fb3faad9b8a0)

![Image](https://github.com/user-attachments/assets/cb88877b-3cd8-4aec-8ac1-bfe498998c0a)

![Image](https://github.com/user-attachments/assets/57c15e35-9b50-4a0d-bc5c-b95562a6f540)

## Lab 2: Tracing your agent

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/njjlv/lab-2:-tracing-your-agent

![Image](https://github.com/user-attachments/assets/a921add0-492c-42f0-a7e5-2c32ff3e02bf)

![Image](https://github.com/user-attachments/assets/dfa1420d-e194-4694-86f2-fe1f06d4fe30)

## adding-router-and-skill-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/x3i1d/adding-router-and-skill-evaluations

![Image](https://github.com/user-attachments/assets/9f189cba-ecb3-4652-8762-c3bcb3a85039)

![Image](https://github.com/user-attachments/assets/8b0656a2-20f9-4692-b15e-728b284477b1)

![Image](https://github.com/user-attachments/assets/50b410ee-89f8-4ebd-bf2c-b5019feaed68)

![Image](https://github.com/user-attachments/assets/729e39c6-1f97-4575-b8d3-ef300cb4c361)

![Image](https://github.com/user-attachments/assets/ed783175-02bd-4c26-a556-65dd9a5a2d63)

![Image](https://github.com/user-attachments/assets/7c8571d9-e003-4a1d-9bc4-fbd0e5d18bb6)

![Image](https://github.com/user-attachments/assets/420f1b59-dcfa-4a14-9ef2-2ee446d17062)

![Image](https://github.com/user-attachments/assets/df89d6a1-4c2f-457a-8fdc-f66745063624)

![Image](https://github.com/user-attachments/assets/3cb87696-bf71-43f0-a284-5f695c52e96a)

## Lab3: adding-router-and-skill-evaluations (Very useful!)

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/yx7uz/lab-3:-adding-router-and-skill-evaluations

Workflow is to run agent on example input data, export corresponding spans from the span traces (via code), then llm-as-judge on them.

![Image](https://github.com/user-attachments/assets/8fcb8b6e-43df-4500-b397-5ee70847c71b)

![Image](https://github.com/user-attachments/assets/2cc726f0-9816-4c4b-9361-b4bd0e9092db)

![Image](https://github.com/user-attachments/assets/5e5db879-91b6-49ec-944a-d97592a8c611)

```python
TOOL_CALLING_PROMPT_TEMPLATE = """
You are an evaluation assistant evaluating questions and tool calls to
determine whether the tool called would answer the question. The tool
calls have been generated by a separate agent, and chosen from the list of
tools provided below. It is your job to decide whether that agent chose
the right tool to call.

    [BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Tool Called]: {tool_call}
    [END DATA]

Your response must be single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"incorrect" means that the chosen tool would not answer the question,
the tool includes information that is not presented in the question,
or that the tool signature includes parameter values that don't match
the formats specified in the tool signatures below.

"correct" means the correct tool call was chosen, the correct parameters
were extracted from the question, the tool call generated is runnable and correct,
and that no outside information not present in the question was used
in the generated question.

    [Tool Definitions]: {tool_definitions}
"""
```

```python
CLARITY_LLM_JUDGE_PROMPT = """
In this task, you will be presented with a query and an answer. Your objective is to evaluate the clarity 
of the answer in addressing the query. A clear response is one that is precise, coherent, and directly 
addresses the query without introducing unnecessary complexity or ambiguity. An unclear response is one 
that is vague, disorganized, or difficult to understand, even if it may be factually correct.

Your response should be a single word: either "clear" or "unclear," and it should not include any other 
text or characters. "clear" indicates that the answer is well-structured, easy to understand, and 
appropriately addresses the query. "unclear" indicates that some part of the response could be better 
structured or worded.
Please carefully consider the query and answer before determining your response.

After analyzing the query and the answer, you must write a detailed explanation of your reasoning to 
justify why you chose either "clear" or "unclear." Avoid stating the final label at the beginning of your 
explanation. Your reasoning should include specific points about how the answer does or does not meet the 
criteria for clarity.

[BEGIN DATA]
Query: {query}
Answer: {response}
[END DATA]
Please analyze the data carefully and provide an explanation followed by your response.

EXPLANATION: Provide your reasoning step by step, evaluating the clarity of the answer based on the query.
LABEL: "clear" or "unclear"
"""
```

```python
SQL_EVAL_GEN_PROMPT = """
SQL Evaluation Prompt:
-----------------------
You are tasked with determining if the SQL generated appropiately answers a given instruction
taking into account its generated query and response.

Data:
-----
- [Instruction]: {question}
  This section contains the specific task or problem that the sql query is intended to solve.

- [Reference Query]: {query_gen}
  This is the sql query submitted for evaluation. Analyze it in the context of the provided
  instruction.

Evaluation:
-----------
Your response should be a single word: either "correct" or "incorrect".
You must assume that the db exists and that columns are appropiately named.
You must take into account the response as additional information to determine the correctness.

- "correct" indicates that the sql query correctly solves the instruction.
- "incorrect" indicates that the sql query correctly does not solve the instruction correctly.

Note: Your response should contain only the word "correct" or "incorrect" with no additional text
or characters.
"""
```

![Image](https://github.com/user-attachments/assets/065c025d-2e7c-43d9-9743-47ad97be6bbc)

## Adding-trajectory-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/xdfwr/adding-trajectory-evaluations

![Image](https://github.com/user-attachments/assets/bb6049ef-1ac4-4cd1-ad7f-d2e38d54b99b)

![Image](https://github.com/user-attachments/assets/6994bc92-38fb-4426-bfdf-98e6fc46a337)

![Image](https://github.com/user-attachments/assets/d4d031d5-8707-454f-b15d-ce3b280e18c5)

![Image](https://github.com/user-attachments/assets/72d43b55-d7d2-4ed6-9c17-20b8895d956e)

![Image](https://github.com/user-attachments/assets/41afa444-4fe5-4e15-8eba-69b223a23245)

![Image](https://github.com/user-attachments/assets/d79118ad-d4c5-43f7-b5ba-33d3de65dc30)

## Lab 4: adding-trajectory-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/an0wh/lab-4:-adding-trajectory-evaluations

![Image](https://github.com/user-attachments/assets/555ad0bb-5a23-4642-ab1c-af76a09ac1b4)

![Image](https://github.com/user-attachments/assets/4db83cda-8280-45aa-8dd8-4513933b7475)

## Adding-structure-to-your-evaluations

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/a2c54/adding-structure-to-your-evaluations

![Image](https://github.com/user-attachments/assets/24dbc6d2-3ce7-45f5-870c-51f05a034b35)

![Image](https://github.com/user-attachments/assets/f7a7180f-432e-496b-8787-909d72bdbf1c)

![Image](https://github.com/user-attachments/assets/b783f7da-2e0c-4ec9-85af-ad36cfd3c98e)

![Image](https://github.com/user-attachments/assets/a0f5d75e-8a81-4d93-ad67-e848105dd03d)

![Image](https://github.com/user-attachments/assets/2c9f7523-d9d8-474c-87f1-33df248acc00)

![Image](https://github.com/user-attachments/assets/ddbcbdf1-6aaa-4d66-a8e2-a92af4a9d3bd)

![Image](https://github.com/user-attachments/assets/1fdf3146-6aee-462d-937c-b041a5dda678)

## Improving LLM-as-a-judge

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/f63n9/improving-your-llm-as-a-judge

![Image](https://github.com/user-attachments/assets/79e74a30-3740-45b5-854c-e79c7bead336)

![Image](https://github.com/user-attachments/assets/ab90414f-42d2-4dd2-bacf-64dc426e76cd)

## Monitoring

https://learn.deeplearning.ai/courses/evaluating-ai-agents/lesson/y5v5y/monitoring-agents

![Image](https://github.com/user-attachments/assets/7470d00e-8305-46cd-b70f-a8cafaa255fd)

![Image](https://github.com/user-attachments/assets/673fd2e0-06f2-4818-b38e-82c30f82b078)

![Image](https://github.com/user-attachments/assets/e2a02d8a-e765-443a-a630-e5545cface34)

![Image](https://github.com/user-attachments/assets/127ef82e-4732-48b1-a061-19d371665a3f)

![Image](https://github.com/user-attachments/assets/b1126f0e-1383-4e86-8b2c-05ae351e85d4)










Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Notes from deeplearning.ai Evaluating AI Agents course #1

Notes from deeplearning.ai Evaluating AI Agents course by Arize

Introduction

Evaluation-in-the-time-of-llms

Decomposing-agents

Lab 1: building-your-agent

Tracing-agents

Lab 2: Tracing your agent

adding-router-and-skill-evaluations

Lab3: adding-router-and-skill-evaluations (Very useful!)

Adding-trajectory-evaluations

Lab 4: adding-trajectory-evaluations

Adding-structure-to-your-evaluations

Improving LLM-as-a-judge

Monitoring

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Notes from deeplearning.ai Evaluating AI Agents course #1

Description

Notes from deeplearning.ai Evaluating AI Agents course by Arize

Introduction

Evaluation-in-the-time-of-llms

Decomposing-agents

Lab 1: building-your-agent

Tracing-agents

Lab 2: Tracing your agent

adding-router-and-skill-evaluations

Lab3: adding-router-and-skill-evaluations (Very useful!)

Adding-trajectory-evaluations

Lab 4: adding-trajectory-evaluations

Adding-structure-to-your-evaluations

Improving LLM-as-a-judge

Monitoring

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions