Skip to content

Conversation

@elfedy
Copy link
Collaborator

@elfedy elfedy commented Sep 12, 2025

Adds the first eval so that we have a skeleton to evaluate and improve our flows. These are meant to be run locally for now with own credentials for LLM models.

An eval has runs which consist on a task (something that the agent performs using the mcp server) and an eval which performs different evaluations of the task. Results are stored locally in an .evals folder under a randomly generated id.

Also adds a clean script to delete run results.

@elfedy elfedy requested a review from snowmead September 12, 2025 18:21
Copy link
Collaborator

@snowmead snowmead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job! 💯

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you include things like zombienet, node, etc. to potentially have other types of evals like having the agent run the node and test transactions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I didn't think that deep but I just left the template as is because any part of it could be part of a potential eval.

@snowmead
Copy link
Collaborator

snowmead commented Sep 12, 2025

proposal: we could probably one shot a very simple run+eval framework so we don't need to duplicate any code when creating new evals

@elfedy
Copy link
Collaborator Author

elfedy commented Sep 12, 2025

proposal: we could probably one shot a very simple run+eval framework so we don't need to duplicate any code when creating new evals

Yes. I think this is better done at the time we create the second eval so we have good info about what needs to be reused and where it makes more sense to just duplicate.

@elfedy elfedy merged commit 0594c81 into main Sep 15, 2025
3 checks passed
@elfedy elfedy deleted the elfedy-eval branch September 15, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants