[v2]: Starting to clean up tests#3306
Conversation
Plan: - Started to remove tests that tests for exisitng functionality - if a test uses MTEB I generally refactor it to use evaluate - generally move stuff around to reflect the structure on the package - If test file contains semantically different tests split it up into multiple - removed `intfloat/multilingual-e5-small` from the integration test so reduce the number of model downloads required
|
I've started looking when our time on tests was increased.
I will try to review tests that we want to remove |
|
This test seem to be especially problematic: task 11 in |
|
This should be |
|
We also can split tests run. E.g. we can move out tests related to metadata to separate to increase feedback speed. Now we're having test that format citation and this is a bit strange, because we need to check what's wrong with them. Maybe it would be better to create github action that test this automatically (because github don't support multiple templates for PR and users don't fill them), but this can be done later and of course in separate issue |
Should we just delete these? (seperate issue) |
replaced it with "IFIRNFCorpus" which is ~3000 samples and ~200 queries |
…oved to models.py
I was talking about our checklists, not sure if we want to delete them, but we need to automate them |
There was a problem hiding this comment.
@Samoed, these tasks now fail. I assume it is because the MockNumpyEncoder is random.
should I fix this to make it more consistent?
There was a problem hiding this comment.
I don't see that this test is failing in logs
There was a problem hiding this comment.
Hmm ahh, it fails locally though...
There was a problem hiding this comment.
Maybe something can be different between numpy versions
There was a problem hiding this comment.
I will try to make the encoder a bit more consistent
tests/test_tasks/test_load_data.py
Outdated
There was a problem hiding this comment.
I would love to remove these tests. I am not entirely sure what it tests (that every task has a load_data?) as everything that we would want to test is mocked.
| assert "18670" in results | ||
|
|
||
|
|
||
| def test_reranker_same_ndcg1(tmp_path: Path): |
There was a problem hiding this comment.
@Samoed can you help me figure out this test?
There was a problem hiding this comment.
This test checks that scores from the encoder can be used for the cross-encoder. But the part that compares it to the same NDCG as the one-stage setup is a bit strange, and I also wanted to remove it (or change it to use mock models). For that, we first need to create a mock cross-encoder. We also need to add a few more tests for cross-encoders.
|
We can limit mteb/mteb/abstasks/AbsTaskAnyClassification.py Lines 113 to 116 in f1c6cf0 |
|
@isaac-chung and @Samoed one of the slowest test I found was the test for get_task, which had a grid of 2592 and had to loop to a lot tasks (sometimes all, depending on the filter). While not individually very slow these were a major part of the overall runtime. I split them up into multiple segments and this reduced it down to 16+18+24+54=112 by splitting it up into multiple instances, which is still a quite extensive grid. |
|
It would be great to have this merged sooner rather than later (suggest to leave the file saving stuff separately) and fully leverage the sped up tests. Having tests run at 1hr+ each PR is pretty not ideal 🙏 |
|
Alright, figured out the mistake here. Took some time as the error only happens when running multiple tests at once. Since |
|
I will just enable auto-merge here - feel free to do a post-review. I will do a few follow-up PRs, but having a test run 10x faster should speed up development a bit |
| logging.basicConfig(level=logging.INFO) | ||
|
|
||
|
|
||
| # NOTE: Covers image and image-text tasks. Can be extended to cover new mixed-modality task types. |
| languages=["eng-Latn"], | ||
| revision="1", | ||
| release_date=None, | ||
| modalities=["text"], |
There was a problem hiding this comment.
Most of the time, mock should support both modalities
| modalities=["text"], | |
| modalities=["text", "image"], |
There was a problem hiding this comment.
but the data is text right?
There was a problem hiding this comment.
Yes, I just think it would be easier if mock models would support all modalities by default, because if they won't support something then tasks should be skipped
| logging.basicConfig(level=logging.INFO) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("task", MOCK_MIEB_TASK_GRID) |
There was a problem hiding this comment.
Should we combine with mteb tasks test?
There was a problem hiding this comment.
We can't given the model right? But we might be able to move it to a better spot
| logging.basicConfig(level=logging.INFO) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("task", TASK_TEST_GRID) |
There was a problem hiding this comment.
We need to add tests for mieb tasks too, but they're huge most of time
There was a problem hiding this comment.
Do we need to do that if we have the mock tasks?
There was a problem hiding this comment.
We need to have integration tests with real datasets format to verify that we can download them correctly
Thanks so much! |
|
Great work! |


Plan:
test_deprecated/*intfloat/multilingual-e5-smallfrom the integration test so reduce the number of model downloads requiredtask_grid.pyas it is notably smallerget_tasksfrom 2592 to 16+18+24+54=112 by splitting it up into multiple instancesI would love to get some feedback on if people agree with these changes, then I will continue
One of the major refactors was
test_benchmark, which contained both integration tests, tests of MTEB, and tests ofget_benchmarkand the benchmarks themselves (e.g. names must be unique)TODO:
figure out where the file saves come fromWill fix it in another PR