Roadmap for v0.4.0 #35

zimmski · 2024-04-17T07:18:15Z

The v0.4.0 is mainly meant for introducing Java to the benchmark. There are two main goals

Simply to the same evaluation we did with 0.3.0 with plain.go, but this time with a plain.java.
Automate the interpretation of the evaluation result as much as possible so we can iterate on new releases faster (every release needs a blog post about the results)

Tasks:

bauersimon · 2024-04-25T14:50:43Z

Scoring, Categorization, Bar Charts split by language.

zimmski · 2024-04-25T15:15:00Z

Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.

zimmski · 2024-04-25T15:42:02Z

Save the descriptons of the models as well: https://openrouter.ai/api/v1/models

The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not

zimmski · 2024-04-25T16:07:21Z

Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4

zimmski · 2024-04-25T19:57:28Z

Write down a playbook for evaluations

e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour berak in between to not run into cached responses.

zimmski · 2024-04-26T11:09:09Z

Bar charts should have have their value on the bar. The axis values do not work that well

zimmski · 2024-04-26T11:40:40Z

Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.

zimmski · 2024-04-26T12:33:45Z

Do test file paths through

symflower symbols
Task for models

zimmski · 2024-04-26T13:13:53Z

Added all follow ups to #79 so this issue is officially closed for changes. We only do the last tasks and then close it

CC @bauersimon

zimmski · 2024-05-02T22:50:36Z

Finally. Eating that cake!

zimmski added the enhancement New feature or request label Apr 17, 2024

zimmski self-assigned this Apr 17, 2024

zimmski added this to the v0.4.0 milestone Apr 17, 2024

zimmski closed this as completed May 2, 2024

zimmski added roadmap Collection of issues for a release and removed enhancement New feature or request labels Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap for v0.4.0 #35

Roadmap for v0.4.0 #35

zimmski commented Apr 17, 2024 •

edited

Loading

bauersimon commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 26, 2024

zimmski commented Apr 26, 2024

zimmski commented Apr 26, 2024

zimmski commented Apr 26, 2024

zimmski commented May 2, 2024

Roadmap for v0.4.0 #35

Roadmap for v0.4.0 #35

Comments

zimmski commented Apr 17, 2024 • edited Loading

bauersimon commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 25, 2024

zimmski commented Apr 26, 2024

zimmski commented Apr 26, 2024

zimmski commented Apr 26, 2024

zimmski commented Apr 26, 2024

zimmski commented May 2, 2024

zimmski commented Apr 17, 2024 •

edited

Loading