Skip to content

BREAKING: v2.0.0 #1433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 159 commits into
base: main
Choose a base branch
from
Draft

BREAKING: v2.0.0 #1433

wants to merge 159 commits into from

Conversation

KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Nov 11, 2024

This is a work-in-progress branch which will be the release of MTEB v2.0.0!

Features:

@x-tabdeveloping, @orionw, @isaac-chung, @Samoed, @gowitheflow-1998 etc. please make PR to this when relevant (MIEB still goes it its own branch but will try to merge it in here)

@KennethEnevoldsen KennethEnevoldsen added this to the v2.0.0 milestone Nov 11, 2024
@isaac-chung isaac-chung marked this pull request as draft November 11, 2024 09:27
orionw and others added 5 commits November 13, 2024 11:30
* update

* merged retrieval; working

* update tasks; working multilingual

* everything working except instructions

* working instructions; just need cleanup

* add metadata for all but MindSmall

* faster evaluation; mindsmall can compute in reasonable time

* fix bad merge of docs

* lint

* fix test

* qa

* updated mindsmall

* lint

* fix debug

* Update mteb/abstasks/dataloaders.py

Co-authored-by: Roman Solomatin <[email protected]>

* lint

---------

Co-authored-by: Roman Solomatin <[email protected]>
Samoed and others added 20 commits November 14, 2024 21:26
* fix: Count unique texts, data leaks in calculate metrics (#1438)

* add more stat

* add more stat

* update statistics

* fix: update task metadata to allow for null (#1448)

* Update tasks table

* 1.19.5

Automatically generated by python-semantic-release

* base

* sync with main

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <[email protected]>
* enable codecarbon by default

* lint

* update flag

* add allow_multiple_runs param

* make lint

* add warning

* lint

* negate the flag

---------

Co-authored-by: Isaac Chung <[email protected]>
* run tasks

* remove test script

* lint

* remove cache

* fix sickbrsts

* fix tests

* add datasets
* fix test

* skip mock

* add message to assert

* fix test

* lint

* fix tests

* upd tests

* update descriptive stats files

* add stat to speed
* multilingual loader

* lint
* add citations

* fix typo
* add code for comupting number of qrels

* add stats fever hotpotqa msmarco topiocqa

* miracl mrtidy

* multilongdoc  miracl reranking

* add multi eurlex

* fix tests for descriptive stats

* fix tests

---------

Co-authored-by: Roman Solomatin <[email protected]>
* add code for comupting number of qrels

* BibleNLPBitextMining descriptive stats added

* SwissJudgementClassification descriptive stats added

* VoyageMMarcoReranking descriptive stats added

* WebLINXCandidatesReranking descriptive stats added

* MultiEURLEXMultilabelClassification descriptive stats added

* MIRACLReranking descriptive stats added

* MindSmallReranking descriptive stats added

* updated test_TaskMetadata

* fix test

---------

Co-authored-by: Imene Kerboua <[email protected]>
Co-authored-by: Imene Kerboua <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>
* fix bright loader

* lint

* fix comment
* fix: Count unique texts, data leaks in calculate metrics (#1438)

* add more stat

* add more stat

* update statistics

* fix: update task metadata to allow for null (#1448)

* Update tasks table

* 1.19.5

Automatically generated by python-semantic-release

* Fix: Made data parsing in the leaderboard figure more robust (#1450)

Bugfixes with data parsing in main figure

* Fixed task loading (#1451)

* Fixed task result loading from disk

* Fixed task result loading from disk

* fix: publish (#1452)

* 1.19.6

Automatically generated by python-semantic-release

* fix: Fix load external results with `None` mteb_version (#1453)

* fix

* lint

* 1.19.7

Automatically generated by python-semantic-release

* WIP: Polishing up leaderboard UI (#1461)

* fix: Removed column wrapping on the table, so that it remains readable

* Added disclaimer to figure

* fix: Added links to task info table, switched out license with metric

* fix: loading pre 1.11.0 (#1460)

* small fix

* fix: fix

* 1.19.8

Automatically generated by python-semantic-release

* fix: swap touche2020 to maintain compatibility (#1469)

swap touche2020 for parity

* 1.19.9

Automatically generated by python-semantic-release

* docs: Add sum per language for task counts (#1468)

* add sum per lang

* add sort by sum option

* make lint

* fix: pinned datasets to <3.0.0 (#1470)

* 1.19.10

Automatically generated by python-semantic-release

* feat: add CUREv1 retrieval dataset (#1459)

* feat: add CUREv1 dataset

---------

Co-authored-by: nadshe <[email protected]>
Co-authored-by: olivierr42 <[email protected]>
Co-authored-by: Daniel Buades Marcos <[email protected]>

* feat: add missing domains to medical tasks

* feat: modify benchmark tasks

* chore: benchmark naming

---------

Co-authored-by: nadshe <[email protected]>
Co-authored-by: olivierr42 <[email protected]>

* Update tasks table

* 1.20.0

Automatically generated by python-semantic-release

* fix: check if `model` attr of model exists (#1499)

* check if model attr of model exists

* lint

* Fix retrieval evaluator

* 1.20.1

Automatically generated by python-semantic-release

* add cure statistics

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Márton Kardos <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: Napuh <[email protected]>
Co-authored-by: Daniel Buades Marcos <[email protected]>
Co-authored-by: nadshe <[email protected]>
Co-authored-by: olivierr42 <[email protected]>
* fix bright loader

* lint

* fix comment

* fix stats

* fix retrieval stats

* update stats

* add rest of the stat

* move bach code

* fix docs

* lint
* fix FilipinoHateSpeechClassification

* update tests
* init

* find all wierd repos

* move to mteb WikipediaRetrievalMultilingual

* add base upload utils

* retrieval, classification, bitextmining

* test retrieval

* test retrieval

* test task uploaded

* update tasks

* working version

* remove comments

* lint

* move upload

* fix tests

* fix test

* move upload to task

* Update mteb/tasks/Retrieval/multilingual/WikipediaRetrievalMultilingual.py

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* fix: hatespeech filipino (#1522)

* fix FilipinoHateSpeechClassification

* update tests

* lint

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>
* fix: Count unique texts, data leaks in calculate metrics (#1438)
* add more stat
* add more stat
* update statistics
* fix: update task metadata to allow for null (#1448)
* Update tasks table
* 1.19.5
Automatically generated by python-semantic-release
* Fix: Made data parsing in the leaderboard figure more robust (#1450)
Bugfixes with data parsing in main figure
* Fixed task loading (#1451)
* Fixed task result loading from disk
* Fixed task result loading from disk
* fix: publish (#1452)
* 1.19.6
Automatically generated by python-semantic-release
* fix: Fix load external results with `None` mteb_version (#1453)
* fix
* lint
* 1.19.7
Automatically generated by python-semantic-release
* WIP: Polishing up leaderboard UI (#1461)
* fix: Removed column wrapping on the table, so that it remains readable
* Added disclaimer to figure
* fix: Added links to task info table, switched out license with metric
* fix: loading pre 1.11.0 (#1460)
* small fix
* fix: fix
* 1.19.8
Automatically generated by python-semantic-release
* fix: swap touche2020 to maintain compatibility (#1469)
swap touche2020 for parity
* 1.19.9
Automatically generated by python-semantic-release
* docs: Add sum per language for task counts (#1468)
* add sum per lang
* add sort by sum option
* make lint
* fix: pinned datasets to <3.0.0 (#1470)
* 1.19.10
Automatically generated by python-semantic-release
* feat: add CUREv1 retrieval dataset (#1459)
* feat: add CUREv1 dataset
---------
Co-authored-by: nadshe <[email protected]>
Co-authored-by: olivierr42 <[email protected]>
Co-authored-by: Daniel Buades Marcos <[email protected]>
* feat: add missing domains to medical tasks
* feat: modify benchmark tasks
* chore: benchmark naming
---------
Co-authored-by: nadshe <[email protected]>
Co-authored-by: olivierr42 <[email protected]>
* Update tasks table
* 1.20.0
Automatically generated by python-semantic-release
* fix: check if `model` attr of model exists (#1499)
* check if model attr of model exists
* lint
* Fix retrieval evaluator
* 1.20.1
Automatically generated by python-semantic-release
* fix: Leaderboard demo data loading (#1507)
* Made get_scores error tolerant
* Added join_revisions, made get_scores failsafe
* Fetching metadata fixed fr HF models
* Added failsafe metadata fetching to leaderboard code
* Added revision joining to leaderboard app
* fix
* Only show models that have metadata, when filter_models is called
* Ran linting
* 1.20.2
Automatically generated by python-semantic-release
* fix: leaderboard only shows models that have ModelMeta (#1508)
Filtering for models that have metadata
* 1.20.3
Automatically generated by python-semantic-release
* fix: align readme with current mteb (#1493)
* align readme with current mteb
* align with mieb branch
* fix test
* 1.20.4
Automatically generated by python-semantic-release
* docs: Add lang family mapping and map to task table (#1486)
* add lang family mapping and map to task table
* make lint
* add back some unclassified lang codes
* Update tasks table
* fix: Ensure that models match the names on embedding-benchmarks/results (#1519)
* 1.20.5
Automatically generated by python-semantic-release
* fix: Adding missing metadata on models and mathcing names up with the results repo (#1528)
* Added Voyage 3 models
* Added correct metadata to Cohere models and matched names with the results repo
* 1.20.6
Automatically generated by python-semantic-release
* feat: Evaluate missing splits (#1525)
* fix: evaluate missing splits (#1268)
* implement partial evaluation for missing splits
* lint
* requested changes done from scratch
* test for missing split evaluation added
* uncomment test
* lint
* avoid circular import
* use TaskResult
* skip tests for now
---------
Co-authored-by: Isaac Chung <[email protected]>
* got test_all_splits_evaluated passing
* tests passing
* address review comments
* make lint
* handle None cases for kg_co2_emissions
* use new results info
---------
Co-authored-by: Thivyanth <[email protected]>
* 1.21.0
Automatically generated by python-semantic-release
* fix: Correct typos superseeded -> superseded (#1532)
fix typo -> superseded
* 1.21.1
Automatically generated by python-semantic-release
* fix: Task load data error for SICK-BR-STS and XStance (#1534)
* fix task load data for two tasks
* correct dataset keys
* 1.21.2
Automatically generated by python-semantic-release
* fix: Proprietary models now get correctly shown in leaderboard (#1530)
* Fixed showing proprietary models in leaderboard
* Added links to all OpenAI models
* Fixed table formatting issues
* Bumped Gradio version
* 1.21.3
Automatically generated by python-semantic-release
* docs: Add Model Meta parameters and metadata (#1536)
* add multi_qa_MiniLM_L6_cos_v1 model meta
* add all_mpnet_base_v2
* add parameters to model meta
* make lint
* add extra params to meta
* fix: add more model meta (jina, e5) (#1537)
* add e5 model meta
* address review comments
* 1.21.4
Automatically generated by python-semantic-release
* Add cohere models (#1538)
* fix: bug cohere names
* format
* fix: add nomic models (#1543)
#1515
* fix: Added all-minilm-l12-v2 (#1542)
#1515
* fix: Added arctic models (#1541)
#1515
* fix: add sentence trimming to OpenAIWrapper (#1526)
* fix: add sentence trimming to OpenAIWrapper
* fix: import tiktoken library inside encode function
* fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name
* fix: pass tokenizer_name, max_tokens to loader
* fix: make tokenizer_name None for default
* fix: delete changes for ModelMeta
* fix: fix revision to 2 for OpenAI models
* fix: add docstring for OpenAIWrapper
* fix: lint
* feat: add openai optional dependency set
* fix: add sleep for too many requests
* fix: add lint
* fix: delete evaluate file
* 1.21.5
Automatically generated by python-semantic-release
* fix: Fixed metadata errors (#1547)
* 1.21.6
Automatically generated by python-semantic-release
* fix: remove curev1 from multlingual (#1552)
Seems like it was added here:
1cc6c9e
* 1.21.7
Automatically generated by python-semantic-release
* fix: Add Model2vec (#1546)
* Added Model2Vec wrapper
* Added Model2vec models
* Added model2vec models to registry
* Added model2vec as a dependency
* Ran linting
* Update mteb/models/model2vec_models.py
Co-authored-by: Kenneth Enevoldsen <[email protected]>
* Update mteb/models/model2vec_models.py
Co-authored-by: Kenneth Enevoldsen <[email protected]>
* Added adapted_from and superseeded_by to model2vec models.
* Added missing import
* Moved pyproject.toml to optional dependencies
* Fixed typos
* Added import error and changed model to model_name
* Added Numpy to frameworks
* Added Numpy to frameworks
* Corrected false info on model2vec models
* Replaced np.inf with maxint
* Update mteb/models/model2vec_models.py
Co-authored-by: Isaac Chung <[email protected]>
* Added option to have infinite max tokens, added it to Model2vec
---------
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
* Made result loading more permissive, changed eval splits for HotPotQA and DBPedia (#1554)
* Removed train and dev from eval splits on HotpotQA
* Removed dev from eval splits on DBPedia
* Made task_results validation more permissive
* Readded exception in get_score
* Ran linting
* 1.21.8
Automatically generated by python-semantic-release
* docs: Correction of SICK-R metadata (#1558)
* Correction of SICK-R metadata
* Correction of SICK-R metadata
---------
Co-authored-by: rposwiata <[email protected]>
* feat(google_models): fix issues and add support for `text-embedding-005` and `text-multilingual-embedding-002` (#1562)
* fix: google_models batching and prompt
* feat: add text-embedding-005 and text-multilingual-embedding-002
* chore: `make lint` errors
* fix: address PR comments
* 1.22.0
Automatically generated by python-semantic-release
* fix(bm25s): search implementation (#1566)
fix: bm25s implementation
* 1.22.1
Automatically generated by python-semantic-release
* docs: Fix dependency library name for bm25s (#1568)
* fix: bm25s implementation
* correct library name
---------
Co-authored-by: Daniel Buades Marcos <[email protected]>
* fix: Add training dataset to model meta (#1561)
* fix: Add training dataset to model meta
Adresses #1556
* Added docs
* format
* feat: (cohere_models) cohere_task_type issue, batch requests and tqdm for visualization (#1564)
* feat: batch requests to cohere models
* fix: use correct task_type
* feat: use tqdm with openai
* fix: explicitely set `show_progress_bar` to False
* fix(publichealth-qa):  ignore rows with `None` values in `question` or `answer` (#1565)
* 1.23.0
Automatically generated by python-semantic-release
* fix wongnai
* update inits
* fix tests
* lint
* update imports
* fix tests
* lint
---------
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Márton Kardos <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: Napuh <[email protected]>
Co-authored-by: Daniel Buades Marcos <[email protected]>
Co-authored-by: nadshe <[email protected]>
Co-authored-by: olivierr42 <[email protected]>
Co-authored-by: Thivyanth <[email protected]>
Co-authored-by: Youngjoon Jang <[email protected]>
Co-authored-by: Rafał Poświata <[email protected]>
# Conflicts:
#	docs/tasks.md
#	mteb/abstasks/AbsTaskClassification.py
#	mteb/abstasks/AbsTaskClusteringFast.py
#	mteb/abstasks/AbsTaskInstructionRetrieval.py
#	mteb/abstasks/AbsTaskMultilabelClassification.py
#	mteb/abstasks/AbsTaskPairClassification.py
#	mteb/abstasks/AbsTaskReranking.py
#	mteb/abstasks/AbsTaskRetrieval.py
#	mteb/abstasks/AbsTaskSTS.py
#	mteb/descriptive_stats/InstructionRetrieval/Core17InstructionRetrieval.json
#	mteb/descriptive_stats/MultilabelClassification/MultiEURLEXMultilabelClassification.json
#	mteb/descriptive_stats/Reranking/AskUbuntuDupQuestions.json
#	mteb/descriptive_stats/Reranking/ESCIReranking.json
#	mteb/descriptive_stats/Reranking/WikipediaRerankingMultilingual.json
#	mteb/descriptive_stats/Retrieval/AppsRetrieval.json
#	mteb/descriptive_stats/Retrieval/BelebeleRetrieval.json
#	mteb/descriptive_stats/Retrieval/COIRCodeSearchNetRetrieval.json
#	mteb/descriptive_stats/Retrieval/CodeEditSearchRetrieval.json
#	mteb/descriptive_stats/Retrieval/CodeFeedbackMT.json
#	mteb/descriptive_stats/Retrieval/CodeFeedbackST.json
#	mteb/descriptive_stats/Retrieval/CodeSearchNetCCRetrieval.json
#	mteb/descriptive_stats/Retrieval/CodeSearchNetRetrieval.json
#	mteb/descriptive_stats/Retrieval/CodeTransOceanContest.json
#	mteb/descriptive_stats/Retrieval/CodeTransOceanDL.json
#	mteb/descriptive_stats/Retrieval/CosQA.json
#	mteb/descriptive_stats/Retrieval/JaqketRetrieval.json
#	mteb/descriptive_stats/Retrieval/NFCorpus.json
#	mteb/descriptive_stats/Retrieval/StackOverflowQA.json
#	mteb/descriptive_stats/Retrieval/SyntheticText2SQL.json
#	mteb/descriptive_stats/Retrieval/Touche2020.json
#	mteb/descriptive_stats/Retrieval/Touche2020Retrieval.v3.json
#	mteb/descriptive_stats/Retrieval/mFollowIRCrossLingualInstructionRetrieval.json
#	mteb/descriptive_stats/Retrieval/mFollowIRInstructionRetrieval.json
#	mteb/evaluation/MTEB.py
#	mteb/evaluation/evaluators/RetrievalEvaluator.py
#	mteb/leaderboard/app.py
#	mteb/leaderboard/figures.py
#	mteb/leaderboard/table.py
#	mteb/model_meta.py
#	mteb/models/arctic_models.py
#	mteb/models/e5_models.py
#	mteb/models/nomic_models.py
#	mteb/models/overview.py
#	mteb/models/sentence_transformers_models.py
#	mteb/tasks/Reranking/zho/CMTEBReranking.py
#	mteb/tasks/Retrieval/__init__.py
#	mteb/tasks/STS/por/SickBrSTS.py
#	pyproject.toml
#	tests/test_benchmark/mock_tasks.py
* sort logos, add mkdocs outline, add index page

* Added tons of documentation

* Added some more docs to abstask

* reduced docs to only include API docs for now

* fixed import hell

* Fixed more nasty import to get docs to work

* API docs work!

* fixed link

* Apply suggestions from code review

Co-authored-by: Isaac Chung <[email protected]>

* format

---------

Co-authored-by: Isaac Chung <[email protected]>
* fix: reorder argument for mteb.get_tasks
This should make the function more intuitive to use
* typo
---------
Co-authored-by: Isaac Chung <[email protected]>
* fix: Make deduplication in PairClassificationEvaluator stable
* remove prompt type
* remove prompt type missed one
---------
Co-authored-by: isaac-chung <[email protected]>
* feat: add new arctic v2.0 models (#1574)

* feat: add new arctic v2.0 models

* chore: make lint

* 1.24.0

Automatically generated by python-semantic-release

* fix: Add namaa MrTydi reranking dataset (#1573)

* Add dataset class and file requirements

* pass tests

* make lint changes

* adjust meta data and remove load_data

---------

Co-authored-by: Omar Elshehy <[email protected]>

* Update tasks table

* 1.24.1

Automatically generated by python-semantic-release

* fix: Eval langs not correctly passed to monolingual tasks (#1587)

* fix SouthAfricanLangClassification.py

* add check for langs

* lint

* 1.24.2

Automatically generated by python-semantic-release

* feat: Add ColBert (#1563)

* feat: add max_sim operator for IR tasks to support multi-vector models

* docs: add doc for Model2VecWrapper.__init__(...)

* feat: add ColBERTWrapper to models & add ColBERTv2

* fix: resolve issues

* fix: resolve issues

* Update README.md

Co-authored-by: Roman Solomatin <[email protected]>

* Update README.md

Co-authored-by: Isaac Chung <[email protected]>

* Update README.md

Co-authored-by: Isaac Chung <[email protected]>

* Update mteb/evaluation/evaluators/RetrievalEvaluator.py

Co-authored-by: Isaac Chung <[email protected]>

* Update README.md

Co-authored-by: Isaac Chung <[email protected]>

* README.md: rm subset

* doc: update example for Late Interaction

* get colbert running without errors

* fix: pass is_query to pylate

* fix: max_sim add pad_sequence

* feat: integrate Jinja templates for ColBERTv2 and add model prompt handling

* feat: add revision & prompt_name

* doc: pad_sequence

* rm TODO jina colbert v2

* doc: warning: higher resource usage for MaxSim

---------

Co-authored-by: sam021313 <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>

* 1.25.0

Automatically generated by python-semantic-release

* doc: colbert add score_function & doc section (#1592)

* doc: colbert add score_function & doc section

* doc: Update README.md

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* doc: Update README.md

Co-authored-by: Isaac Chung <[email protected]>

---------

Co-authored-by: sam021313 <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>

* Feat: add support for scoring function (#1594)

* add support for scoring function

* lint

* move similarity to wrapper

* remove score function

* lint

* remove from InstructionRetrievalEvaluator

* Update mteb/evaluation/evaluators/RetrievalEvaluator.py

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* remove score function from README.md

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* Add new models nvidia, gte, linq (#1436)

* Add new models nvidia, gte, linq
* add warning for gte-Qwen and nvidia models re: instruction used in docs as well
---------
Co-authored-by: isaac-chung <[email protected]>

* Leaderboard: Refined plots (#1601)

* Added embedding size guide to performance-size plot, removed shading on radar chart

* Changed plot names to something more descriptive

* Made plots failsafe

* fix: Leaderboard refinements (#1603)

* Added explanation of aggregate measures

* Added download button to result tables

* Task info gets sorted by task name

* Added custom, shareable links for each benchmark

* Moved explanation of aggregate metrics to the summary tab

* 1.25.1

Automatically generated by python-semantic-release

* Feat: Use similarity scores if available (#1602)

* Use similarity scores if available

* lint

* Add NanoBEIR Datasets (#1588)

* add NanoClimateFeverRetrieval task, still requires some debugging
* move task to correct place in init file
* add all Nano datasets and results
* format code
* Update mteb/tasks/Retrieval/eng/tempCodeRunnerFile.py
Co-authored-by: Roman Solomatin <[email protected]>
* pin revision to commit and add datasets to benchmark.py
* create new benchmark for NanoBEIR
* add revision when loading datasets
* lint
---------
Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: isaac-chung <[email protected]>

* Update tasks table

* Feat: Evaluate missing languages (#1584)

* init
* fix tests
* update mock retrieval
* update tests
* use subsets instead of langs
* Apply suggestions from code review
Co-authored-by: Isaac Chung <[email protected]>
* fix tests
* add to readme
* rename subset in readme
---------
Co-authored-by: Isaac Chung <[email protected]>

* Add IBM Granite Embedding Models (#1613)

* add IBM granite embedding models
* lint formatting
* add adapted_from and superseded_by to ModelMeta

* fix: disable co2_tracker for API models (#1614)

* 1.25.2

Automatically generated by python-semantic-release

* fix: set `use_instructions` to True in models using prompts (#1616)

feat: set `use_instructions` to True in models using prompts

* 1.25.3

Automatically generated by python-semantic-release

* update RetrievalEvaluator.py

* update imports

* update imports and metadata

* fix tests

* fix tests

* fix output path for retrieval

* fix similarity function

---------

Co-authored-by: Daniel Buades Marcos <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Omar Elshehy <[email protected]>
Co-authored-by: Omar Elshehy <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Sam <[email protected]>
Co-authored-by: sam021313 <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: Alexey Vatolin <[email protected]>
Co-authored-by: Márton Kardos <[email protected]>
Co-authored-by: KGupta10 <[email protected]>
Co-authored-by: Aashka Trivedi <[email protected]>
Samoed and others added 29 commits June 10, 2025 01:06
…eturning empty model_meta (#2776)

* 1) Raise an error when model not found using get_model_meta instead of returning empty model_meta
2) Added more helpful error messages

```py
import mteb

meta = mteb.get_model_meta("BAAI/bge-m3")
# Before fix: returns empty model meta and raises a warning that it was not found on HF
# expected behaviour: Raise an error

# After fix: Raises the following error:
# KeyError: "Model 'BAI/bge-m3' not found in MTEB registry nor on the Huggingface Hub. Did you mean: 'BAAI/bge-m3' or BAAI/bge-small-zh?"
```

This is technically a breaking change, I can move it to v2?

* fix undefined variable

* added loader to default model meta and now return modelmeta when it is found
* Add run_task for running tasks

This is the start the deprecation of mteb.MTEB.

The planned interface is:

```py
result: TaskResult = mteb.run_task(model, task)

results: list[TaskResult] = mteb.run_tasks(model, tasks)
```

* fix

* Added cache_strategy and overwrite_strategy

* format

* redo cache implementation following suggestions

* fixes cicular imports

* Added tests

* format

* cleanup

* format

* added corrections based on feedback

* format

* restructure test cache

* fix cache issues

* move todo to issue

* resolve tests

* fix typo in tests

* fix criteria using an Enum

* moved todos to issue #2791

* convert overwrite_strategy from literal to str | Enum

* Added deprecation warning

* format

* Add AbstaskAggregate support to run_task

* minor refactoring Enums

* Update .gitignore

Co-authored-by: Roman Solomatin <[email protected]>

* Use strEnum compatible with earlier python versions

* ensure that mteb can be imported in python 3.9

* make project uv installable

* formatted toml

* redisable codecarbon as a default installation

* Moved modalities, languages, Score, Split and converted types/* to private

* revert enum refactors

* deleted METRIC_NAME and METRIC_VALUE

* refactor LANGUAGES

* merge Split and SplitName

* refactor HFSubset

* refactored statistics

* merge UrlString and STR_URL

* delete unused LangMapping

* prevent import of TaskMetadata from mteb.abstasks

* delete now unused custom_validators.py

* delete unused caching.py

* refactored MODEL_NAME and REVISION

* Convert types to PEP8 compliant PascalCase

* refactor PR

* ensure codecarbon is present for tests

* fix grammar error

* rename types to CamelCase

* delete `normalize_embeddings.py`

* fix import issue in tests

---------

Co-authored-by: Roman Solomatin <[email protected]>
fix dialog task loading
* [v2] refactor `languages.py`

Refactored languages.py into a module. This should be fully backwards compatible.

Fixes #2808

* rename language_script.py > language_scripts.py

* moved language .json object to correct module

* fixes name of test file

* fixes error messages to not refer to a path, but instead to an import
add mock dialog retrieval task
If people agree then I will add a new issue for making a tutorial on task selection for benchmarks using a clustering approach.

Adresses #2809
* Merge main into v2

* fix model imports

* added missing task imports

* added missing desriptive stats
* Merge main into v2

* fix model imports

* added missing task imports

* refactor task import

This refactors imports following this pattern:

```py
# tasks/__init__
from .Retrieval import *
# tasks/retrieval/__init__
from .eng import *
# tasks/retrieval/eng/__init__
from .task1 import Task1
```
proposed by @Samoed in #2825. This should reduce the number of imports required, while not exposing any of the module requires at the task definition.

* added missing desriptive stats

* format
* Merge main into v2

* fix model imports

* added missing task imports

* refactor task import

This refactors imports following this pattern:

```py
# tasks/__init__
from .Retrieval import *
# tasks/retrieval/__init__
from .eng import *
# tasks/retrieval/eng/__init__
from .task1 import Task1
```
proposed by @Samoed in #2825. This should reduce the number of imports required, while not exposing any of the module requires at the task definition.

* added missing desriptive stats

* fix: : rename TaskMetadata.py to resolve class/module ambiguity

related to: #1124
required for: #2714

Seems like we in multiple places denote the module instead of the intended TaskMetada. This rename should fix that issue

relies on PR #2828

* format
* Merge main into v2

* fix model imports

* added missing task imports

* added missing desriptive stats

* fix: Added docs for `mteb.evaluate`

- renamed `mteb.run_tasks` to `mteb.evaluate`. Reverting this is fairly easy but I think the rename makes a lot of sense
- Added docs to most places
  - some aren't changed yet as they haven't been tested (#2830)
  - I didn't change the datasheet to avoid confusion with uploaded datasets

partly fixes: #2793

* format

* fix import

* Update docs/mieb/readme.md

Co-authored-by: Isaac Chung <[email protected]>

* Update docs/usage/usage.md

Co-authored-by: Isaac Chung <[email protected]>

---------

Co-authored-by: Isaac Chung <[email protected]>
* refactor copali to use new interface wip

* use v2 interface

* receive only dataloader
* add ListConRanker model

* updated the implementation of  ListConRanker

* updated the release date of ListConRanker

* added the training datasets and changed the release date of ListConRanker

* updated the training datasets of ListConRanker

* lint

* fix import

---------

Co-authored-by: Roman Solomatin <[email protected]>
Add IFIR relevant tasks.

Signed-off-by: SighingSnow <[email protected]>
- Move all implementations into a seperate folder called `model_implementations`
- moved `encoder_interface.py` and `model_meta.py` into `models`
- renamed `models/*` to `encoder_implementions/*` to make the distinction from between the two folder clear
- merged `models/utils.py` into the only model that used it

We seems to have a few differing names when referring to a model (ModelMeta, get_model, etc.) and encoders (Encoder, AbsEncoder). Should we try to do something about this or just leave it as is?

There is also an inconsistency between how tasks and implementations are in seperate folders, but for benchmark this is not the case.

We could convert it to:
```
benchmarks/tasks/models
| - implementations/*
| - ... # definitions utilities etc.
```
But I am not sure it is worth it and for tasks it might be too much nesting. So I would probably leave it as is.

Note: There is a few refactors that I would like to do on top of this, but will add that to seperate PR (since it is too hard to review here)

Fixes #2299
* introduce AbsTaskAnyClustering

* trigger CI

* remove image clustering abstask

* revert

* address review comments

* fix tests

* fix descriptive stats tests

* fix for mteb eng v1 datasets
* introduce AbsTaskAnyZeroShotClassification

* fix tests

* address review comments

* add mock text ZS task and handle text case

* fix tests

* pass all encode kwargs
* fix: refactor models modules

- refactored loading of models - now all ModelMeta are imported
- fixed a few metadata issues due to missing imports
- renamed private methods to `_{prev_name}` to indicate that they are private
- renamed `models/overview.py` > `models/get_model_meta.py``
- fixed a few typing issues in the models module

* fix typing

* fixed spelling

* minor fixes to imports for clarity

* rollback readme

* allow revision to be None

* fix extract models names
* bump ruff (#2784)

* Update issue and pr templates (#2782)

* Update issue templates

* Update bug_report.md

* test yaml template

* add templates

* update templates

* add emojis

* fix typo

* Apply suggestions from code review

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* update issue titles

* update PR template

* remove PR templates

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* model: Add GeoGPT-Research-Project/GeoEmbedding (#2773)

* add model: geogpt_models

* update geogpt_models

* use InstructSentenceTransformerWrapper

* resolve pylint warning

* format geogpt_models.py

* Update mteb/models/geogpt_models.py

Co-authored-by: Roman Solomatin <[email protected]>

* Update mteb/models/geogpt_models.py

---------

Co-authored-by: zhangzeqing <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>

* model: add fangxq/XYZ-embedding (#2741)

* add xyz model

* add xyz model

* add xyz model

* update

* update

* update

* update

* update

* update

* update

* lint

---------

Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>

* ci: fix config error for semantic release (#2800)

discussed in: #2796

* dataset: Add R2MED Benchmark (#2795)

* Add files via upload

* Add files via upload

* Update benchmarks.py

* Update __init__.py

* Add files via upload

* Update R2MEDRetrieval.py

* Update run_mteb_r2med.py

* Delete scripts/run_mteb_r2med.py

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

* Update mteb/tasks/Retrieval/eng/R2MEDRetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

* Add files via upload

* Delete mteb/descriptive_stats/Retrieval/R2MEDRetrieval.json

* Add files via upload

* Add files via upload

* Add files via upload

* Update R2MEDRetrieval.py

* Add files via upload

* Add files via upload

* Add files via upload

* Add files via upload

* format citations

* Update R2MEDRetrieval.py

* Add files via upload

* Add files via upload

---------

Co-authored-by: Li Lei <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* Update tasks & benchmarks tables

* Update training datasets of GeoGPT-Research-Project/GeoEmbedding (#2802)

update training datasets

Co-authored-by: zhangzeqing <[email protected]>

* fix: Add adapted_from to Cmedqaretrieval (#2806)

* fix: Add adapted_from to Cmedqaretrieval

Also snuck in a fix with form=None, which is no longer valid, but was still used in a few places.

* format

* 1.38.28

Automatically generated by python-semantic-release

* fix: Adding client arg to init method of OpenAI models wrapper (#2803)

* Adding OpenAI client arg to init method (e.g., for already initialized AzureOpenAI client)

To use OpenAI embedding models via Azure, the model wrapper needs to be initialized with a different client.

* Update mteb/models/openai_models.py

Co-authored-by: Roman Solomatin <[email protected]>

* Update mteb/models/openai_models.py

* remove comment and format

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* model: Add annamodels/LGAI-Embedding-Preview (#2810)

Add LGAI-Embedding

- Add mteb/models/lgai_embedding_models.py

- defined model metadata

* fix: Ensure bright uses the correct revision (#2812)

fixes #2811

* 1.38.29

Automatically generated by python-semantic-release

* add description to issue template (#2817)

* add description to template

* fix typo

* model: Added 3 HIT-TMG's KaLM-embedding models (#2478)

* Added HIT-TMG_KaLM-embedding-multilingual-mini-instruct-v1 with instruct wrapper

* Added KaLM_embedding_multilingual_mini_instruct_v1_5

* Added model to overview.py

* Fix Task Count Per Language Table in tasks.md

* resolve conflicts

* remove tasks.md

* Modified get_instruction funcion

* Added support for prompt dict in get_instruction

* fix lang code

* Address comments

* Delete mteb/models/check_models.py

* added prompts_dict support in InstructSentenceTransformerWrapper

* corrected instruction format

* corrected prompts format

* added correct instruction format

* fix implementation

* remove `if name main`

* add comment

---------

Co-authored-by: Roman Solomatin <[email protected]>

* fix: Reuploaded previously unavailable SNL datasets (#2819)

* fix: Reuploaded previously unavailable SNL datasets

closes #2477

* removed exceptions from tests

* temp fixes

* added temporary fix

* clean up commented out code

* format

* Update tasks & benchmarks tables

* 1.38.30

Automatically generated by python-semantic-release

* docs: Fix some typos in `docs/usage/usage.md` (#2835)

* Update usage.md

* Update usage.md

* Update docs/usage/usage.md

---------

Co-authored-by: Isaac Chung <[email protected]>

* model: Add custom instructions for GigaEmbeddings (#2836)

* add custom instructions

* fixed

* lint

* fix last instruction

---------

Co-authored-by: Kolodin Egor <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* model: add Seed-1.6-embedding model (#2841)

* add Seed-1.6-embedding model

* Update seed_1_6_embedding_models.py

* update model meta info

* support image encoder interface

* error fix

* fix: format seed_1_6_embedding_models.py with Ruff

* fix: Update model selection for the leaderboard (#2855)

* fix: Update model selection for the leaderboard

fixes #2834

This removed the lower bound selection, but generally I don't think people should care about the models being too small.

* fix 1M --> 1B

* format

* rename model_size -> max_model_size

* 1.38.31

Automatically generated by python-semantic-release

* fix: update training dataset info of Seed-1.6-embedding model  (#2857)

update seed1.6 model training data info

* 1.38.32

Automatically generated by python-semantic-release

* add jinav4 model meta (#2858)

* add model meta

* linting

* fix: add check for code lora

* fix: apply review comments

* fix: prompt validation for tasks with `-` (#2846)

* fix prompt validation

* fix task name split correctly

* add docstring for test

* 1.38.33

Automatically generated by python-semantic-release

* model: Adding Sailesh97/Hinvec (#2842)

* Adding Hinvec Model's Meta data.

* Adding hinvec_model.py

* Update mteb/models/hinvec_models.py

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* formated code with Black and lint with Ruff

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* Bump gradio to fix leaderboard sorting (#2866)

Bump gradio

* model: Adding nvidia/llama-nemoretriever-colembed models (#2861)

* nvidia_llama_nemoretriever_colembed

* correct 3b reference

* lint fix

* add training data and license for nvidia/llama_nemoretriever_colembed

* lint

---------

Co-authored-by: Isaac Chung <[email protected]>

* rename seed-1.6-embedding to seed1.6-embedding (#2870)

* fix tests to be compatible with `SentenceTransformers` `v5` (#2875)

* fix sbert `v5`

* add comment

* model: add listconranker modelmeta (#2874)

* add listconranker modelmeta

* fix bugs

* use linter

* lint

---------

Co-authored-by: Roman Solomatin <[email protected]>

* model: add kalm_models ModelMeta (new PR) (#2853)

* feat: add KaLM_Embedding_X_0605 in kalm_models

* Update kalm_models.py for lint format

---------

Co-authored-by: xinshuohu <[email protected]>

* Comment kalm model (#2877)

comment kalm model

* Add and fix some Japanese datasets: ANLP datasets, JaCWIR, JQaRA (#2872)

* Add JaCWIR and JQaRA for reranking

* Fix ANLP Journal datasets

* Add NLPJournalAbsArticleRetrieval and JaCWIRRetrieval

* tackle test cases

* Remove _evaluate_subset usage

* Separate v1 and v2

* Update info for NLP Journal datasets

* Update tasks & benchmarks tables

* model: add Hakim and TookaSBERTV2 models (#2826)

* add tooka v2s

* add mcinext models

* update mcinext.py

* Apply PR review suggestions

* Update mteb/models/mcinext_models.py

---------

Co-authored-by: mehran <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>

* dataset: Evalita dataset integration (#2859)

* Added DadoEvalCoarseClassification

* Removed unnecessary columns from DadoEvalCoarseClassification

* Added EmitClassification task

* added SardiStanceClassification task

* Added GeoLingItClassification task

* Added DisCoTexPairClassification tasks

* Added EmitClassification, DadoEvalCoarseClassification, GeoLingItClassification, SardiStanceClassification inside the inits

* changed import in DisCoTexPairClassification

* removed GeoLingItClassification dataset

* fixed citation formatting, missing metadata parameters and lint formatting

* - Added XGlueWRPReranking task
- Added missing __init__.py files

* fixed metadata in XGlueWRPReranking

* Added MKQARetrieval task

* fixed type in XGlueWRPReranking

* changed MKQARetrieval from  cross-lingual to monolingual

* formatted MKQARetrieval file

* removed unused const

---------

Co-authored-by: Mattia Sangermano <[email protected]>

* Update tasks & benchmarks tables

* fix: pin datasets version (#2892)

fix datasets version

* 1.38.34

Automatically generated by python-semantic-release

* fix model implementations

* fix tasks

* add metrics

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: Hypothesis-Z <[email protected]>
Co-authored-by: zhangzeqing <[email protected]>
Co-authored-by: fangxiaoquan <[email protected]>
Co-authored-by: Li Lei <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: malteos <[email protected]>
Co-authored-by: annamodels <[email protected]>
Co-authored-by: Munot Ayush Sunil <[email protected]>
Co-authored-by: Sadra Barikbin <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: Egor <[email protected]>
Co-authored-by: Kolodin Egor <[email protected]>
Co-authored-by: Quan Yuhan <[email protected]>
Co-authored-by: Quan Yuhan <[email protected]>
Co-authored-by: Mohammad Kalim Akram <[email protected]>
Co-authored-by: Sailesh Panda <[email protected]>
Co-authored-by: bschifferer <[email protected]>
Co-authored-by: tutuDoki <[email protected]>
Co-authored-by: Xinshuo Hu <[email protected]>
Co-authored-by: xinshuohu <[email protected]>
Co-authored-by: lsz05 <[email protected]>
Co-authored-by: Mehran Sarmadi <[email protected]>
Co-authored-by: mehran <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: MattiaSangermano <[email protected]>
Co-authored-by: Mattia Sangermano <[email protected]>
# Conflicts:
#	docs/create_tasks_table.py
#	docs/usage/usage.md
#	mteb/evaluation/evaluators/RetrievalEvaluator.py
#	mteb/models/instruct_wrapper.py
#	mteb/models/model_implementations/fa_models.py
#	mteb/models/model_implementations/jina_models.py
#	mteb/models/model_implementations/ru_sentence_models.py
#	mteb/models/overview.py
#	mteb/models/wrapper.py
#	mteb/tasks/Classification/__init__.py
#	mteb/tasks/Classification/ita/DadoEvalCoarseClassification.py
#	mteb/tasks/Classification/ita/SardiStanceClassification.py
#	mteb/tasks/Clustering/nob/snl_clustering.py
#	mteb/tasks/MultiLabelClassification/__init__.py
#	mteb/tasks/MultiLabelClassification/ita/EmitClassification.py
#	mteb/tasks/PairClassification/__init__.py
#	mteb/tasks/PairClassification/ita/DisCoTexPairClassification.py
#	mteb/tasks/Reranking/__init__.py
#	mteb/tasks/Reranking/jpn/JQaRAReranking.py
#	mteb/tasks/Reranking/jpn/JaCWIRReranking.py
#	mteb/tasks/Reranking/multilingual/XGlueWPRReranking.py
#	mteb/tasks/Retrieval/__init__.py
#	mteb/tasks/Retrieval/eng/R2MEDRetrieval.py
#	mteb/tasks/Retrieval/jpn/JaCWIRRetrieval.py
#	mteb/tasks/Retrieval/jpn/NLPJournalAbsArticleRetrieval.py
#	mteb/tasks/Retrieval/jpn/NLPJournalAbsIntroRetrieval.py
#	mteb/tasks/Retrieval/jpn/NLPJournalTitleAbsRetrieval.py
#	mteb/tasks/Retrieval/jpn/NLPJournalTitleIntroRetrieval.py
#	mteb/tasks/Retrieval/multilingual/MKQARetrieval.py
#	pyproject.toml
#	tests/test_benchmark/mock_models.py
#	tests/test_benchmark/test_benchmark.py
* start adding

* standardize statistics

* remove irrelevant file

* update retrieval calculation

* update zeroshot statistics

* fix random
* fix retrieval dataset upload

* add readme repo type

* fix adapted

* add reupload flag

* fix tasks uploading

* add reupload datasets flag

* reupload reuploaded MIRACLRetrieval.py

* fix trust remote code

* prepare miracl for reuploading

* use mteb miracl

* support qrels split

* roll back miracle

* remove reuload flag
* fix: Update ResultsCache

- [x] Added tests
- [x] Added utility interfaces for examining the cache
- [x] Added load_results
- [x] Updated docs to use ResultsCache instead

We could also update the leaderboard to use ResultCache, but I don't want to do that in this PR. When that is done I would probably depracate `mteb.load_results` or convert it to a shorthand function for
```py
ResultCache().load_result(**kwargs)
```
Deprecating leads to less breaking changes.

Minor:
- removed `results/` from .gitignore

* fixed based on copilot feedback

* fix issues in tests

* Apply suggestions from code review

Co-authored-by: Isaac Chung <[email protected]>

* fix tests

* fix issues arising from multiple version across remote and results folder

---------

Co-authored-by: Isaac Chung <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.