[v2]: Starting to clean up tests by KennethEnevoldsen · Pull Request #3306 · embeddings-benchmark/mteb

KennethEnevoldsen · 2025-10-10T09:48:37Z

Plan:

Started to remove tests that tests already tested functionality (minimal only when I am 100%)
if a test uses MTEB I generally refactor it to use evaluate
generally move stuff around to reflect the structure on the package
If test file contains semantically different tests split it up into multiple
for all deprecated functions I moved the tests into test_deprecated/*
speed:
- removed intfloat/multilingual-e5-small from the integration test so reduce the number of model downloads required
- replaced "IntructIR" with "IFIRNFCorpus" in the task_grid.py as it is notably smaller
- Reduce grid for testing, get_tasks from 2592 to 16+18+24+54=112 by splitting it up into multiple instances

I would love to get some feedback on if people agree with these changes, then I will continue

One of the major refactors was test_benchmark, which contained both integration tests, tests of MTEB, and tests of get_benchmark and the benchmarks themselves (e.g. names must be unique)

TODO:

fix remaining issues in getting tests to run
~~figure out where the file saves come from~~ Will fix it in another PR

Plan: - Started to remove tests that tests for exisitng functionality - if a test uses MTEB I generally refactor it to use evaluate - generally move stuff around to reflect the structure on the package - If test file contains semantically different tests split it up into multiple - removed `intfloat/multilingual-e5-small` from the integration test so reduce the number of model downloads required

Samoed · 2025-10-10T10:08:25Z

I've started looking when our time on tests was increased.

After updating retrieval format +10 min [v2] Change corpus and queries to use dataset #2885 (52 min https://github.com/embeddings-benchmark/mteb/actions/runs/16714678318/usage, previous pr 42 min https://github.com/embeddings-benchmark/mteb/actions/runs/16706198422/usage) Maybe we need to change TwitterHjerneRetrieval as test for retrieval dataloader and use NanoBeir tasks
And after this merge of main another +10 min [v2] Merge main 30 08 #3102 (this 1h tests https://github.com/embeddings-benchmark/mteb/actions/runs/17370564430/usage, previous pr has 50 min tests https://github.com/embeddings-benchmark/mteb/actions/runs/17233888466/usage?pr=3040) maybe evaluators run takes too long

I will try to review tests that we want to remove

KennethEnevoldsen · 2025-10-10T10:37:20Z

@Samoed

This test seem to be especially problematic:test_benchmark_integration_with_datasets.py

task 11 in test_benchmark_datasets seems to be Core17InstructionRetrieval or InstructIR

1216.05s call     tests/test_benchmark/test_benchmark_integration_with_datasets.py::test_benchmark_datasets[model0-task11]
47.03s call     tests/test_cli.py::test_create_meta
38.62s call     tests/test_overview.py::test_get_tasks_size_differences
20.43s call     tests/test_benchmark/test_benchmark.py::test_multiple_mteb_tasks[model0-tasks0]
19.23s call     tests/test_cli.py::test_run_task[intfloat/multilingual-e5-small-BornholmBitextMining-fd1525a9fd15316a2d503bf26ab031a61d056e98]

…ecated/test_MTEB.py

Samoed · 2025-10-10T10:53:13Z

This should be IntructIR, it has 9906 queries, maybe this is the reason

mteb/mteb/descriptive_stats/Reranking/InstructIR.json

Line 10 in a95c931

"num_queries": 9906,

, because Core17InstructionRetrieval have similar corpus, but run fast, but only 40 queries

mteb/mteb/descriptive_stats/InstructionReranking/Core17InstructionRetrieval.json

Line 10 in a95c931

"num_queries": 40,

Samoed · 2025-10-10T21:05:48Z

We also can split tests run. E.g. we can move out tests related to metadata to separate to increase feedback speed. Now we're having test that format citation and this is a bit strange, because we need to check what's wrong with them.

Maybe it would be better to create github action that test this automatically (because github don't support multiple templates for PR and users don't fill them), but this can be done later and of course in separate issue

KennethEnevoldsen · 2025-10-11T07:40:26Z

because github don't support multiple templates for PR and users don't fill them

Should we just delete these? (seperate issue)

KennethEnevoldsen · 2025-10-11T07:48:24Z

This should be IntructIR, it has 9906 queries, maybe this is the reason

replaced it with "IFIRNFCorpus" which is ~3000 samples and ~200 queries

…TEB_splits.py

…oved to models.py

…tructure

…flows.py

Samoed · 2025-10-11T08:45:11Z

because github don't support multiple templates for PR and users don't fill them

Should we just delete these? (seperate issue)

I was talking about our checklists, not sure if we want to delete them, but we need to automate them

KennethEnevoldsen · 2025-10-11T13:33:18Z

tests/test_tasks/test_predictions.py

@Samoed, these tasks now fail. I assume it is because the MockNumpyEncoder is random.

should I fix this to make it more consistent?

I don't see that this test is failing in logs

Hmm ahh, it fails locally though...

Maybe something can be different between numpy versions

I will try to make the encoder a bit more consistent

I've rounded predictions in #3322

tests/test_integrations/test_prompts.py

KennethEnevoldsen · 2025-10-12T08:33:51Z

tests/test_tasks/test_load_data.py

I would love to remove these tests. I am not entirely sure what it tests (that every task has a load_data?) as everything that we would want to test is mocked.

KennethEnevoldsen · 2025-10-12T08:37:26Z

tests/test_tasks/test_convert_to_reranking.py

    assert "18670" in results


 def test_reranker_same_ndcg1(tmp_path: Path):


@Samoed can you help me figure out this test?

This test checks that scores from the encoder can be used for the cross-encoder. But the part that compares it to the same NDCG as the one-stage setup is a bit strange, and I also wanted to remove it (or change it to use mock models). For that, we first need to create a mock cross-encoder. We also need to add a few more tests for cross-encoders.

…e_model_task.py

Samoed · 2025-10-12T08:57:34Z

We can limit max_iter for classification task to speedup tests

mteb/mteb/abstasks/AbsTaskAnyClassification.py

Lines 113 to 116 in f1c6cf0

    
           classifier: SklearnClassifierProtocol = LogisticRegression( 
        
               n_jobs=-1, 
        
               max_iter=100, 
        
           )

KennethEnevoldsen · 2025-10-12T09:10:23Z

@isaac-chung and @Samoed one of the slowest test I found was the test for get_task, which had a grid of 2592 and had to loop to a lot tasks (sometimes all, depending on the filter). While not individually very slow these were a major part of the overall runtime.

I split them up into multiple segments and this reduced it down to 16+18+24+54=112 by splitting it up into multiple instances, which is still a quite extensive grid.

isaac-chung · 2025-10-12T14:56:42Z

It would be great to have this merged sooner rather than later (suggest to leave the file saving stuff separately) and fully leverage the sped up tests. Having tests run at 1hr+ each PR is pretty not ideal 🙏

KennethEnevoldsen · 2025-10-12T15:29:37Z

Alright, figured out the mistake here. Took some time as the error only happens when running multiple tests at once.

Since mteb.evaluate unloads the dataset after it has been run (added by @Samoed and a very reasonable default for memory) this causes the task.dataset to be set to None, which leads future test to fail. This only happens in Mock*Tasks though as they set self.dataset["train"] using a function to generate the data. The fix is simple simply recreate self.dataset in the load_data function.

…k/mteb into faster-tests

KennethEnevoldsen · 2025-10-12T16:01:46Z

I think this is the remaining tests that needs to be solved, but it is down to 3 min:

In comparison, we started at >30min locally (all of it is with a hot cache of models and datasets)

I will take a shower and make some dinner, though, so I will leave it here.

It seems to only be a local error, though (I suspect it's something to do with the numpy version, seed, or something like that). Anyway, that shouldn't happen, so I will fix that. However, assuming this runs fine - do feel free to merge (and then I will do it in another PR)

KennethEnevoldsen · 2025-10-12T16:13:11Z

These are the slowest test with a hot cache, without it is almost always the integration tests with datasets:

There are some obvious fixes:

Change out Colbert with a smaller model e.g. answerdotai/answerai-colbert-small-v1
1, 5) Reduce the number of iterations of the linear classifier used in MockClassificationTask (I actually just added this)
2, 4) are both CLI, which I imagine takes some time to start up (mostly from loading mteb, I imagine)

KennethEnevoldsen · 2025-10-12T16:20:39Z

I will just enable auto-merge here - feel free to do a post-review. I will do a few follow-up PRs, but having a test run 10x faster should speed up development a bit

Samoed · 2025-10-12T16:33:44Z

tests/test_integrations/test_modality.py

+logging.basicConfig(level=logging.INFO)
+
+
+# NOTE: Covers image and image-text tasks. Can be extended to cover new mixed-modality task types.


Should we add issues for this?

tests/mock_tasks.py

Samoed · 2025-10-12T16:37:49Z

tests/mock_models.py

+    languages=["eng-Latn"],
+    revision="1",
+    release_date=None,
+    modalities=["text"],


Most of the time, mock should support both modalities

Suggested change

modalities=["text"],

modalities=["text", "image"],

but the data is text right?

Yes, I just think it would be easier if mock models would support all modalities by default, because if they won't support something then tasks should be skipped

Samoed · 2025-10-12T16:42:37Z

tests/test_integrations/test_image_model_task.py

+logging.basicConfig(level=logging.INFO)
+
+
+@pytest.mark.parametrize("task", MOCK_MIEB_TASK_GRID)


Should we combine with mteb tasks test?

We can't given the model right? But we might be able to move it to a better spot

tests/test_integrations/test_integration_with_sentencetransformers.py

Samoed · 2025-10-12T16:44:03Z

tests/test_integrations/test_integration_with_datasets.py

+logging.basicConfig(level=logging.INFO)
+
+
+@pytest.mark.parametrize("task", TASK_TEST_GRID)


We need to add tests for mieb tasks too, but they're huge most of time

Do we need to do that if we have the mock tasks?

We need to have integration tests with real datasets format to verify that we can download them correctly

isaac-chung · 2025-10-12T17:10:41Z

I will just enable auto-merge here - feel free to do a post-review. I will do a few follow-up PRs, but having a test run 10x faster should speed up development a bit

Thanks so much!

Samoed · 2025-10-12T17:28:31Z

Great work!

KennethEnevoldsen requested review from Samoed and isaac-chung October 10, 2025 09:49

KennethEnevoldsen added 2 commits October 10, 2025 12:37

refactor test_models.py into test_bm25 and test_colbert.py

390e71b

refactor MTEB specific tests out of test_model_meta.py into test_depr…

903bb47

…ecated/test_MTEB.py

KennethEnevoldsen added 2 commits October 11, 2025 09:44

move load_results to test_deprecated folder

0590938

fix failing test

5c0becb

KennethEnevoldsen added 8 commits October 11, 2025 09:49

replace InstructIR with IFIRNFCorpus in the test task grid

37aed53

format

b5ddafc

Move test_split_evaluation.py to test_deprecated and rename to test_M…

78ddba3

…TEB_splits.py

rename test_embedding_caching to test_CachedEmbeddingWrapper.py and m…

fc4955d

…oved to models.py

rename test_load_results to test_results in accordance with package s…

6313c71

…tructure

removed fixed todo

55d822e

refactored out validate_task_to_prompt.py from test_reproducible_work…

a162e37

…flows.py

sped up get_tasks_size_differences()

742d9a6

KennethEnevoldsen added 5 commits October 11, 2025 10:55

refactored test_benchmarks.py out to multiple components

18ea32f

ibid

770f37e

merge

df59492

fix import issues after refactors

8e951ed

format

67b4976

Samoed mentioned this pull request Oct 11, 2025

Create github action that fills checkilists #3316

Closed

fix test errors

15c2cd9

KennethEnevoldsen commented Oct 11, 2025

View reviewed changes

KennethEnevoldsen commented Oct 12, 2025

View reviewed changes

KennethEnevoldsen added 3 commits October 12, 2025 10:44

moved and refactored test_mieb_datasets > test_integrations/test_imag…

6a0a590

…e_model_task.py

remove added TODOs

96bc4aa

added minor type casting to resolve type errors

bb2262c

reduce grid for testing get_tasks from 2592 to 16+18+24+54=112

1a52eec

KennethEnevoldsen added 2 commits October 12, 2025 11:11

remove test_load_data

0f0241c

remove unec. condition in test_all_metadata_is_filled_and_valid

8b955dd

reduce sub tests in get_tasks_size_differences

ad1f142

isaac-chung and others added 4 commits October 12, 2025 18:29

default prompts with empty dict instead of None

e789cda

fix tests

88bec4c

Merge branch 'faster-tests' of https://github.com/embeddings-benchmar…

73a9056

…k/mteb into faster-tests

fix missing cases

02efe1f

reduce iteration of test classification

e9825b7

KennethEnevoldsen mentioned this pull request Oct 12, 2025

[v2] fix tests that fail only when run locally #3324

Closed

KennethEnevoldsen enabled auto-merge (squash) October 12, 2025 16:19

increase max_iter

532ff54

Samoed reviewed Oct 12, 2025

View reviewed changes

KennethEnevoldsen merged commit f3d1d4b into v2.0.0 Oct 12, 2025
10 checks passed

KennethEnevoldsen deleted the faster-tests branch October 12, 2025 16:46

		assert "18670" in results


		def test_reranker_same_ndcg1(tmp_path: Path):

		logging.basicConfig(level=logging.INFO)


		# NOTE: Covers image and image-text tasks. Can be extended to cover new mixed-modality task types.

		logging.basicConfig(level=logging.INFO)


		@pytest.mark.parametrize("task", MOCK_MIEB_TASK_GRID)

		logging.basicConfig(level=logging.INFO)


		@pytest.mark.parametrize("task", TASK_TEST_GRID)

Conversation

KennethEnevoldsen commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KennethEnevoldsen commented Oct 11, 2025

Uh oh!

KennethEnevoldsen commented Oct 11, 2025

Uh oh!

Samoed commented Oct 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed commented Oct 12, 2025

Uh oh!

KennethEnevoldsen commented Oct 12, 2025

Uh oh!

isaac-chung commented Oct 12, 2025

Uh oh!

KennethEnevoldsen commented Oct 12, 2025

Uh oh!

KennethEnevoldsen commented Oct 12, 2025

Uh oh!

KennethEnevoldsen commented Oct 12, 2025

Uh oh!

KennethEnevoldsen commented Oct 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

isaac-chung commented Oct 12, 2025

Uh oh!

KennethEnevoldsen commented Oct 10, 2025 •

edited

Loading

Samoed commented Oct 10, 2025 •

edited

Loading

KennethEnevoldsen commented Oct 10, 2025 •

edited

Loading

Samoed commented Oct 10, 2025 •

edited

Loading

Samoed commented Oct 10, 2025 •

edited

Loading