-
Notifications
You must be signed in to change notification settings - Fork 3
add eval for security_review prompt #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
snowmead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job! 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you include things like zombienet, node, etc. to potentially have other types of evals like having the agent run the node and test transactions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I didn't think that deep but I just left the template as is because any part of it could be part of a potential eval.
|
proposal: we could probably one shot a very simple run+eval framework so we don't need to duplicate any code when creating new evals |
Yes. I think this is better done at the time we create the second eval so we have good info about what needs to be reused and where it makes more sense to just duplicate. |
Adds the first eval so that we have a skeleton to evaluate and improve our flows. These are meant to be run locally for now with own credentials for LLM models.
An
evalhasrunswhich consist on atask(something that the agent performs using the mcp server) and anevalwhich performs different evaluations of the task. Results are stored locally in an.evalsfolder under a randomly generated id.Also adds a clean script to delete run results.