Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for v0.4.0 #35

Closed
30 tasks done
zimmski opened this issue Apr 17, 2024 · 10 comments
Closed
30 tasks done

Roadmap for v0.4.0 #35

zimmski opened this issue Apr 17, 2024 · 10 comments
Assignees
Labels
roadmap Collection of issues for a release
Milestone

Comments

@zimmski
Copy link
Member

zimmski commented Apr 17, 2024

The v0.4.0 is mainly meant for introducing Java to the benchmark. There are two main goals

  1. Simply to the same evaluation we did with 0.3.0 with plain.go, but this time with a plain.java.
  2. Automate the interpretation of the evaluation result as much as possible so we can iterate on new releases faster (every release needs a blog post about the results)

Tasks:

@zimmski zimmski added the enhancement New feature or request label Apr 17, 2024
@zimmski zimmski self-assigned this Apr 17, 2024
@zimmski zimmski added this to the v0.4.0 milestone Apr 17, 2024
@bauersimon
Copy link
Member

Scoring, Categorization, Bar Charts split by language.

@zimmski
Copy link
Member Author

zimmski commented Apr 25, 2024

Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.

@zimmski
Copy link
Member Author

zimmski commented Apr 25, 2024

Save the descriptons of the models as well: https://openrouter.ai/api/v1/models

The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not

@zimmski
Copy link
Member Author

zimmski commented Apr 25, 2024

Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4

@zimmski
Copy link
Member Author

zimmski commented Apr 25, 2024

Write down a playbook for evaluations

  • e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour berak in between to not run into cached responses.

@zimmski
Copy link
Member Author

zimmski commented Apr 26, 2024

Bar charts should have have their value on the bar. The axis values do not work that well

@zimmski
Copy link
Member Author

zimmski commented Apr 26, 2024

Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.

@zimmski
Copy link
Member Author

zimmski commented Apr 26, 2024

Do test file paths through

  • symflower symbols
  • Task for models

@zimmski
Copy link
Member Author

zimmski commented Apr 26, 2024

Added all follow ups to #79 so this issue is officially closed for changes. We only do the last tasks and then close it

CC @bauersimon

@zimmski
Copy link
Member Author

zimmski commented May 2, 2024

Finally. Eating that cake!

@zimmski zimmski closed this as completed May 2, 2024
@zimmski zimmski added roadmap Collection of issues for a release and removed enhancement New feature or request labels Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
roadmap Collection of issues for a release
Projects
None yet
Development

No branches or pull requests

2 participants