Skip to content

[v2]: Starting to clean up tests#3306

Merged
KennethEnevoldsen merged 38 commits intov2.0.0from
faster-tests
Oct 12, 2025
Merged

[v2]: Starting to clean up tests#3306
KennethEnevoldsen merged 38 commits intov2.0.0from
faster-tests

Conversation

@KennethEnevoldsen
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen commented Oct 10, 2025

Plan:

  • Started to remove tests that tests already tested functionality (minimal only when I am 100%)
  • if a test uses MTEB I generally refactor it to use evaluate
  • generally move stuff around to reflect the structure on the package
  • If test file contains semantically different tests split it up into multiple
  • for all deprecated functions I moved the tests into test_deprecated/*
  • speed:
    • removed intfloat/multilingual-e5-small from the integration test so reduce the number of model downloads required
    • replaced "IntructIR" with "IFIRNFCorpus" in the task_grid.py as it is notably smaller
    • Reduce grid for testing, get_tasks from 2592 to 16+18+24+54=112 by splitting it up into multiple instances

I would love to get some feedback on if people agree with these changes, then I will continue

One of the major refactors was test_benchmark, which contained both integration tests, tests of MTEB, and tests of get_benchmark and the benchmarks themselves (e.g. names must be unique)

TODO:

  •  fix remaining issues in getting tests to run
  •  figure out where the file saves come from Will fix it in another PR

Plan:
- Started to remove tests that tests for exisitng functionality
- if a test uses MTEB I generally refactor it to use evaluate
- generally move stuff around to reflect the structure on the package
- If test file contains semantically different tests split it up into multiple
- removed `intfloat/multilingual-e5-small` from the integration test so reduce the number of model downloads required
@Samoed
Copy link
Member

Samoed commented Oct 10, 2025

I've started looking when our time on tests was increased.

  1. After updating retrieval format +10 min [v2] Change corpus and queries to use dataset #2885 (52 min https://github.com/embeddings-benchmark/mteb/actions/runs/16714678318/usage, previous pr 42 min https://github.com/embeddings-benchmark/mteb/actions/runs/16706198422/usage) Maybe we need to change TwitterHjerneRetrieval as test for retrieval dataloader and use NanoBeir tasks
  2. And after this merge of main another +10 min [v2] Merge main 30 08 #3102 (this 1h tests https://github.com/embeddings-benchmark/mteb/actions/runs/17370564430/usage, previous pr has 50 min tests https://github.com/embeddings-benchmark/mteb/actions/runs/17233888466/usage?pr=3040) maybe evaluators run takes too long

I will try to review tests that we want to remove

@KennethEnevoldsen
Copy link
Contributor Author

KennethEnevoldsen commented Oct 10, 2025

@Samoed

This test seem to be especially problematic:test_benchmark_integration_with_datasets.py

task 11 in test_benchmark_datasets seems to be Core17InstructionRetrieval or InstructIR

1216.05s call     tests/test_benchmark/test_benchmark_integration_with_datasets.py::test_benchmark_datasets[model0-task11]
47.03s call     tests/test_cli.py::test_create_meta
38.62s call     tests/test_overview.py::test_get_tasks_size_differences
20.43s call     tests/test_benchmark/test_benchmark.py::test_multiple_mteb_tasks[model0-tasks0]
19.23s call     tests/test_cli.py::test_run_task[intfloat/multilingual-e5-small-BornholmBitextMining-fd1525a9fd15316a2d503bf26ab031a61d056e98]

@Samoed
Copy link
Member

Samoed commented Oct 10, 2025

This should be IntructIR, it has 9906 queries, maybe this is the reason

, because Core17InstructionRetrieval have similar corpus, but run fast, but only 40 queries

@Samoed
Copy link
Member

Samoed commented Oct 10, 2025

We also can split tests run. E.g. we can move out tests related to metadata to separate to increase feedback speed. Now we're having test that format citation and this is a bit strange, because we need to check what's wrong with them.

Maybe it would be better to create github action that test this automatically (because github don't support multiple templates for PR and users don't fill them), but this can be done later and of course in separate issue

@KennethEnevoldsen
Copy link
Contributor Author

because github don't support multiple templates for PR and users don't fill them

Should we just delete these? (seperate issue)

@KennethEnevoldsen
Copy link
Contributor Author

This should be IntructIR, it has 9906 queries, maybe this is the reason

replaced it with "IFIRNFCorpus" which is ~3000 samples and ~200 queries

@Samoed
Copy link
Member

Samoed commented Oct 11, 2025

because github don't support multiple templates for PR and users don't fill them

Should we just delete these? (seperate issue)

I was talking about our checklists, not sure if we want to delete them, but we need to automate them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samoed, these tasks now fail. I assume it is because the MockNumpyEncoder is random.

should I fix this to make it more consistent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that this test is failing in logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ahh, it fails locally though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something can be different between numpy versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to make the encoder a bit more consistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rounded predictions in #3322

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to remove these tests. I am not entirely sure what it tests (that every task has a load_data?) as everything that we would want to test is mocked.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

assert "18670" in results


def test_reranker_same_ndcg1(tmp_path: Path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Samoed can you help me figure out this test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks that scores from the encoder can be used for the cross-encoder. But the part that compares it to the same NDCG as the one-stage setup is a bit strange, and I also wanted to remove it (or change it to use mock models). For that, we first need to create a mock cross-encoder. We also need to add a few more tests for cross-encoders.

@Samoed
Copy link
Member

Samoed commented Oct 12, 2025

We can limit max_iter for classification task to speedup tests

classifier: SklearnClassifierProtocol = LogisticRegression(
n_jobs=-1,
max_iter=100,
)

@KennethEnevoldsen
Copy link
Contributor Author

@isaac-chung and @Samoed one of the slowest test I found was the test for get_task, which had a grid of 2592 and had to loop to a lot tasks (sometimes all, depending on the filter). While not individually very slow these were a major part of the overall runtime.

I split them up into multiple segments and this reduced it down to 16+18+24+54=112 by splitting it up into multiple instances, which is still a quite extensive grid.

@isaac-chung
Copy link
Collaborator

It would be great to have this merged sooner rather than later (suggest to leave the file saving stuff separately) and fully leverage the sped up tests. Having tests run at 1hr+ each PR is pretty not ideal 🙏

@KennethEnevoldsen
Copy link
Contributor Author

Alright, figured out the mistake here. Took some time as the error only happens when running multiple tests at once.

Since mteb.evaluate unloads the dataset after it has been run (added by @Samoed and a very reasonable default for memory) this causes the task.dataset to be set to None, which leads future test to fail. This only happens in Mock*Tasks though as they set self.dataset["train"] using a function to generate the data. The fix is simple simply recreate self.dataset in the load_data function.

@KennethEnevoldsen
Copy link
Contributor Author

I think this is the remaining tests that needs to be solved, but it is down to 3 min:

Screenshot 2025-10-12 at 17 56 52

In comparison, we started at >30min locally (all of it is with a hot cache of models and datasets)

I will take a shower and make some dinner, though, so I will leave it here.

It seems to only be a local error, though (I suspect it's something to do with the numpy version, seed, or something like that). Anyway, that shouldn't happen, so I will fix that. However, assuming this runs fine - do feel free to merge (and then I will do it in another PR)

@KennethEnevoldsen
Copy link
Contributor Author

These are the slowest test with a hot cache, without it is almost always the integration tests with datasets:

Screenshot 2025-10-12 at 18 02 18

There are some obvious fixes:

  1. Change out Colbert with a smaller model e.g. answerdotai/answerai-colbert-small-v1
    1, 5) Reduce the number of iterations of the linear classifier used in MockClassificationTask (I actually just added this)
    2, 4) are both CLI, which I imagine takes some time to start up (mostly from loading mteb, I imagine)

@KennethEnevoldsen
Copy link
Contributor Author

I will just enable auto-merge here - feel free to do a post-review. I will do a few follow-up PRs, but having a test run 10x faster should speed up development a bit

logging.basicConfig(level=logging.INFO)


# NOTE: Covers image and image-text tasks. Can be extended to cover new mixed-modality task types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add issues for this?

languages=["eng-Latn"],
revision="1",
release_date=None,
modalities=["text"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the time, mock should support both modalities

Suggested change
modalities=["text"],
modalities=["text", "image"],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the data is text right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just think it would be easier if mock models would support all modalities by default, because if they won't support something then tasks should be skipped

logging.basicConfig(level=logging.INFO)


@pytest.mark.parametrize("task", MOCK_MIEB_TASK_GRID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we combine with mteb tasks test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't given the model right? But we might be able to move it to a better spot

logging.basicConfig(level=logging.INFO)


@pytest.mark.parametrize("task", TASK_TEST_GRID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add tests for mieb tasks too, but they're huge most of time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do that if we have the mock tasks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have integration tests with real datasets format to verify that we can download them correctly

@KennethEnevoldsen KennethEnevoldsen merged commit f3d1d4b into v2.0.0 Oct 12, 2025
10 checks passed
@KennethEnevoldsen KennethEnevoldsen deleted the faster-tests branch October 12, 2025 16:46
@isaac-chung
Copy link
Collaborator

I will just enable auto-merge here - feel free to do a post-review. I will do a few follow-up PRs, but having a test run 10x faster should speed up development a bit

Thanks so much!

@Samoed
Copy link
Member

Samoed commented Oct 12, 2025

Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants