add eval for security_review prompt #34

elfedy · 2025-09-12T18:20:29Z

Adds the first eval so that we have a skeleton to evaluate and improve our flows. These are meant to be run locally for now with own credentials for LLM models.

An eval has runs which consist on a task (something that the agent performs using the mcp server) and an eval which performs different evaluations of the task. Results are stored locally in an .evals folder under a randomly generated id.

Also adds a clean script to delete run results.

snowmead

Awesome job! 💯

evals/src/evals/security_review_prompts.ts

snowmead · 2025-09-12T19:33:18Z

evals/examples/escrow/README.md

Did you include things like zombienet, node, etc. to potentially have other types of evals like having the agent run the node and test transactions?

Yes I didn't think that deep but I just left the template as is because any part of it could be part of a potential eval.

snowmead · 2025-09-12T19:45:36Z

proposal: we could probably one shot a very simple run+eval framework so we don't need to duplicate any code when creating new evals

elfedy · 2025-09-12T21:25:23Z

proposal: we could probably one shot a very simple run+eval framework so we don't need to duplicate any code when creating new evals

Yes. I think this is better done at the time we create the second eval so we have good info about what needs to be reused and where it makes more sense to just duplicate.

elfedy added 13 commits September 10, 2025 15:01

Initial scaffold

b4f19f2

Move example

19d9502

Add a clean command

24ff5b9

Add neverthrow

d35ca69

forward process env

83556c9

Clean up some stuff

3782269

separate run and eval

49bc6fa

Concatenate assistant messages in output

6913ed9

Add some structure and implement logging

a3b6af0

Add a bit of structure

d2ee591

Introduce the concept of a task

86741ec

use the slash command to invoke the prompt

6745862

Better structure and add README

063188b

elfedy requested a review from snowmead September 12, 2025 18:21

snowmead approved these changes Sep 12, 2025

View reviewed changes

elfedy added 3 commits September 12, 2025 18:32

score is -1 if failed to parse score

4ca12f9

better error scores and rename evaluator output

c26d401

Use sonnet directly for the eval

86b1fb0

elfedy merged commit 0594c81 into main Sep 15, 2025
3 checks passed

elfedy deleted the elfedy-eval branch September 15, 2025 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add eval for security_review prompt #34

add eval for security_review prompt #34

Uh oh!

elfedy commented Sep 12, 2025

Uh oh!

snowmead left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snowmead Sep 12, 2025

Uh oh!

elfedy Sep 12, 2025

Uh oh!

snowmead commented Sep 12, 2025 •

edited

Loading

Uh oh!

elfedy commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

add eval for security_review prompt #34

add eval for security_review prompt #34

Uh oh!

Conversation

elfedy commented Sep 12, 2025

Uh oh!

snowmead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

snowmead Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

elfedy Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

snowmead commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elfedy commented Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snowmead commented Sep 12, 2025 •

edited

Loading