Skip to content

Releases: instructlab/eval

Leaderboard v0.6.0

16 Apr 05:42
cea8acd
Compare
Choose a tag to compare

Leaderboard v0.6.0

This release of the InstructLab/eval library provides support for the Leaderboardv2 benchmark.

To use the new leaderboard evaluator, install it with pip install instructlab-eval[leaderboard] and then import LeaderboardV2Evaluator from instructlab.eval.leaderboard:

from instructlab.eval.leaderboard import LeaderboardV2Evaluator

evaluator = LeaderboardV2Evaluator(model_path="meta-llama/Llama-3.1-8B-Instruct", num_gpus=8)
result = evaluator.run()
print(f"Results for meta-llama/Llama-3.1-8B-Instruct: {result['overall_score']}")

This new evaluator supports running in one of two ways:

  • Running locally: this will evaluate in an optimized fashion by splitting tasks between vLLM and HF Transformers
  • Running remotely: You can provide an OpenAI client and this will evaluator will simply make calls there.

What's Changed

Here's a comprehensive outline of all the changes made:

  • ci: Add OpenAI keys into CI by @alimaredia in #221
  • build(deps): bump sarisia/actions-status-discord from 1.15.1 to 1.15.3 by @dependabot in #220
  • build(deps): bump hynek/build-and-inspect-python-package from 2.11.0 to 2.12.0 by @dependabot in #217
  • build(deps): bump rhysd/actionlint from 1.7.4 to 1.7.7 in /.github/workflows by @dependabot in #216
  • build(deps): bump step-security/harden-runner from 2.10.3 to 2.10.4 by @dependabot in #215
  • build(deps): bump DavidAnson/markdownlint-cli2-action from 18.0.0 to 19.1.0 by @dependabot in #213
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.45.0 to 0.46.0 by @dependabot in #207
  • ci: Don't require secrets in medium e2e test by @danmcp in #226
  • build(deps): bump actions/setup-python from 5.3.0 to 5.4.0 by @dependabot in #225
  • build(deps): bump machulav/ec2-github-runner from 2.3.7 to 2.3.8 by @dependabot in #224
  • build(deps): bump aws-actions/configure-aws-credentials from 4.0.2 to 4.0.3 by @dependabot in #223
  • build(deps): bump pypa/gh-action-pypi-publish from 1.12.3 to 1.12.4 by @dependabot in #222
  • build(deps): bump aws-actions/configure-aws-credentials from 4.0.3 to 4.1.0 by @dependabot in #228
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.46.0 to 0.47.0 by @dependabot in #229
  • build(deps): bump step-security/harden-runner from 2.10.4 to 2.11.0 by @dependabot in #230
  • build(deps): bump actions/cache from 4.2.0 to 4.2.1 by @dependabot in #231
  • build(deps): bump actions/cache from 4.2.1 to 4.2.2 by @dependabot in #233
  • build(deps): bump actions/download-artifact from 4.1.8 to 4.1.9 by @dependabot in #232
  • build(deps): bump actions/setup-python from 5.4.0 to 5.5.0 by @dependabot in #239
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.47.0 to 0.48.0 by @dependabot in #240
  • build(deps): bump step-security/harden-runner from 2.11.0 to 2.11.1 by @dependabot in #241
  • build(deps): bump actions/download-artifact from 4.1.9 to 4.2.1 by @dependabot in #237
  • build(deps): bump actions/cache from 4.2.2 to 4.2.3 by @dependabot in #236
  • Implement leaderboard as a benchmark by @RobotSail in #234

Full Changelog: v0.5.1...v0.6.0

v0.5.1

21 Jan 16:24
bdece44
Compare
Choose a tag to compare

What's Changed

  • chore: Change default temporary write directory in all e2e CI jobs from tmpfs to /home/tmp by @courtneypacheco in #210
  • build(deps): bump step-security/harden-runner from 2.10.2 to 2.10.3 by @dependabot in #209
  • Bump ragas version by @alimaredia in #212

New Contributors

Full Changelog: v0.5.0...v0.5.1

v0.5.0

09 Jan 23:23
e31d19b
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.4.2...v0.5.0

v0.4.2

13 Dec 22:29
c086116
Compare
Choose a tag to compare

What's Changed

  • build(deps): bump DavidAnson/markdownlint-cli2-action from 17.0.0 to 18.0.0 by @dependabot in #180
  • Adjust to slack-github-action 2.0 api changes by @danmcp in #182
  • Don't fail fast for unit and functional tests by @danmcp in #183
  • Add make judge single test by @danmcp in #184
  • Add reorg answer file test by @danmcp in #185
  • Add disk check after tests run by @danmcp in #190
  • Move AWS_REGION from using secret to var by @danmcp in #191
  • build(deps): bump actions/cache from 4.1.2 to 4.2.0 by @dependabot in #192
  • build(deps): bump step-security/harden-runner from 2.10.1 to 2.10.2 by @dependabot in #186
  • Allows MMLU to have the system_prompt provided to it by @RobotSail in #197

New Contributors

Full Changelog: v0.4.1...v0.4.2

v0.4.1

14 Nov 22:19
4bde0b3
Compare
Choose a tag to compare

What's Changed

  • Handle no valid eval results for mt_bench by @danmcp in #179

Full Changelog: v0.4.0...v0.4.1

v0.4.0

12 Nov 23:44
8e32704
Compare
Choose a tag to compare

What's Changed

  • build(deps): bump rhysd/actionlint from 1.7.2 to 1.7.3 in /.github/workflows by @dependabot in #142
  • Add missing comment for error_rate return by @danmcp in #141
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.42.0 to 0.43.0 by @dependabot in #147
  • build(deps): bump actions/checkout from 4.2.0 to 4.2.1 by @dependabot in #146
  • build(deps-dev): update pre-commit requirement from <4.0,>=3.0.4 to >=3.0.4,<5.0 by @dependabot in #145
  • build(deps): bump pypa/gh-action-pypi-publish from 1.10.2 to 1.10.3 by @dependabot in #144
  • chore: rename 'basic-workflow-tests' to 'e2e-custom' by @nathan-weinberg in #152
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.43.0 to 0.43.1 by @dependabot in #154
  • Give nice error for empty taxonomy by @danmcp in #151
  • ci: change small E2E CI job to medium by @nathan-weinberg in #155
  • ci: add large-size E2E CI job by @nathan-weinberg in #157
  • ci: use org variable for AWS EC2 AMI in E2E CI jobs by @nathan-weinberg in #159
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.43.1 to 0.44.0 by @dependabot in #160
  • build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #161
  • ci: convert med E2E CI job to L4 GPU by @nathan-weinberg in #162
  • build(deps): bump actions/checkout from 4.2.1 to 4.2.2 by @dependabot in #158
  • build(deps): bump pypa/gh-action-pypi-publish from 1.10.3 to 1.11.0 by @dependabot in #164
  • feat: use custom http_client by @leseb in #163
  • build(deps): bump hynek/build-and-inspect-python-package from 2.9.0 to 2.10.0 by @dependabot in #166
  • build(deps): bump machulav/ec2-github-runner from 2.3.6 to 2.3.7 by @dependabot in #167
  • Add facilities for unit and functional tests by @danmcp in #165
  • build(deps): bump rhysd/actionlint from 1.7.3 to 1.7.4 in /.github/workflows by @dependabot in #168
  • build(deps): bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.0 by @dependabot in #170
  • build(deps): bump rojopolis/spellcheck-github-actions from 0.44.0 to 0.45.0 by @dependabot in #171
  • build(deps): bump pypa/gh-action-pypi-publish from 1.12.0 to 1.12.2 by @dependabot in #175
  • Add check data unit tests by @danmcp in #169
  • Undo commit of unit cov and add to gitignore by @danmcp in #172
  • Remove functional test output and add to .gitignore by @danmcp in #173
  • Add model adapter unit tests by @danmcp in #174

New Contributors

Full Changelog: v0.3.1...v0.4.0

v0.3.1

01 Oct 01:45
c05af4d
Compare
Choose a tag to compare

What's Changed

  • Remove task logic with lm_eval 0.4.4 for agg_score by @danmcp in #143

Full Changelog: v0.3.0...v0.3.1

v0.3.0

28 Sep 01:07
40cc370
Compare
Choose a tag to compare

What's Changed

Note: This release contains two changes which aren't backwards compatible:

  • Remove max_workers and serving_gpus from constructor by @danmcp in #140
  • return overall_score from MTBenchBranch.judge_answers() by @alimaredia in #138

Full Changelog: v0.2.1...v0.3.0

v0.2.1

23 Sep 14:10
53d6abf
Compare
Choose a tag to compare

What's Changed

  • update README by @sallyom in #108
  • Use single answer file and model list (backport #110) by @mergify in #112
  • mergify: add mergify configuration by @nathan-weinberg in #114
  • Bump step-security/harden-runner from 2.8.1 to 2.9.1 by @dependabot in #94
  • ci: move E2E runner from github to AWS by @nathan-weinberg in #118
  • docs: add initial release strategy doc and CHANGELOG by @nathan-weinberg in #91
  • CI: Fix working directories to be relative by @danmcp in #120
  • Bump actions/setup-python from 5.1.1 to 5.2.0 by @dependabot in #119
  • Bump actions/checkout from 4.1.6 to 4.1.7 by @dependabot in #116
  • build(deps): bump pypa/gh-action-pypi-publish from 1.9.0 to 1.10.0 by @dependabot in #122
  • ci: add AWS tags to show github ref and PR num for all jobs by @nathan-weinberg in #123
  • Bump rojopolis/spellcheck-github-actions from 0.38.0 to 0.41.0 by @dependabot in #96
  • build(deps): bump pypa/gh-action-pypi-publish from 1.10.0 to 1.10.1 by @dependabot in #124
  • build(deps): bump hynek/build-and-inspect-python-package from 2.6.0 to 2.9.0 by @dependabot in #125
  • build(deps): bump DavidAnson/markdownlint-cli2-action from 16.0.0 to 17.0.0 by @dependabot in #126
  • build(deps): bump step-security/harden-runner from 2.9.1 to 2.10.1 by @dependabot in #127
  • Add comment to make it clear how the code is working by @danmcp in #105
  • Allow for external serving to be used with mmlu by @danmcp in #99
  • Better path and string handling by @danmcp in #106
  • Improve logging by @danmcp in #111
  • Cleanup usage of load model answers by @danmcp in #115
  • add option to pass 'api_key' to gen_answers, judge_answers by @sallyom in #128
  • e2e: only run PR job if certain files are changed by @nathan-weinberg in #131
  • Allow max_workers to be passed in after evaluator is created by @danmcp in #107
  • Remove fastchat dependency by @danmcp in #98

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.1.2

27 Aug 23:30
ff54038
Compare
Choose a tag to compare

What's Changed

  • Use single answer file and model list by @danmcp in #110

Full Changelog: v0.1.1...v0.1.2