Adds VocalSound dataset #2337

mina-parham · 2025-03-12T03:42:55Z

Closes #2313

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Samoed · 2025-03-12T06:48:43Z

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py

+    def dataset_transform(self):
+        self.dataset["train"] = self.dataset["test"]


I think it would be better to use cross-validation instead of test

Test split should be removed, because for now training and testing would be on same test split

isaac-chung

Nice! Just a few small comments.

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py

Co-authored-by: Isaac Chung <[email protected]>

…-benchmark/mteb into AddDataset/VocalSound

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py

isaac-chung

Good progress so far.

Please remove all other changes related to dict.fromkeys in this PR. (Can be a sep PR to main)
Please run this dataset using one of the existing audio embedding models in the branch to confirm that your changes work

isaac-chung · 2025-04-14T06:07:35Z

docs/create_tasks_table.py

-                table_dict[lang] = {k: 0 for k in sorted(get_args(TASK_TYPE))}
+                table_dict[lang] = dict.fromkeys(sorted(get_args(TASK_TYPE)), 0)


I'd prefer if these changes were reverted. Feel free to open a separate issue, and a PR to main

It seems that this file was formatted with different ruff version

isaac-chung · 2025-04-14T06:09:08Z

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py

+            month=may }
+                }""",
+        descriptive_stats={
+            "n_samples": {"validation": 1860, "test": 3594},


To match eval_splits

Suggested change

"n_samples": {"validation": 1860, "test": 3594},

"n_samples": {"test": 3594},

Adds VocalSound dataset

93d21c2

mina-parham added new dataset Issues related to adding a new task or dataset maeb Audio extension labels Mar 12, 2025

mina-parham requested a review from isaac-chung March 12, 2025 03:42

mina-parham self-assigned this Mar 12, 2025

Samoed reviewed Mar 12, 2025

View reviewed changes

isaac-chung reviewed Mar 12, 2025

View reviewed changes

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py Outdated Show resolved Hide resolved

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py Outdated Show resolved Hide resolved

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py Outdated Show resolved Hide resolved

mina-parham and others added 5 commits March 12, 2025 20:48

Update n_samples

b233801

Co-authored-by: Isaac Chung <[email protected]>

Change eval_splits to test

a31e4a4

Co-authored-by: Isaac Chung <[email protected]>

Merge branch 'maeb' into AddDataset/VocalSound

52c977b

Remove test split

3066d42

Merge branch 'AddDataset/VocalSound' of https://github.com/embeddings…

a7aef07

…-benchmark/mteb into AddDataset/VocalSound

mina-parham requested review from isaac-chung and Samoed March 26, 2025 05:09

make lint

71d18ea

Samoed reviewed Mar 26, 2025

View reviewed changes

mteb/tasks/Audio/AudioClassification/eng/VocalSound.py Show resolved Hide resolved

Samoed linked an issue Mar 26, 2025 that may be closed by this pull request

Add VocalSound #2313

Open

isaac-chung reviewed Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds VocalSound dataset #2337

Adds VocalSound dataset #2337

mina-parham commented Mar 12, 2025 •

edited by Samoed

Loading

Samoed Mar 12, 2025

Samoed Mar 13, 2025

isaac-chung left a comment

isaac-chung left a comment

isaac-chung Apr 14, 2025

Samoed Apr 14, 2025

isaac-chung Apr 14, 2025

		def dataset_transform(self):
		self.dataset["train"] = self.dataset["test"]

		table_dict[lang] = {k: 0 for k in sorted(get_args(TASK_TYPE))}
		table_dict[lang] = dict.fromkeys(sorted(get_args(TASK_TYPE)), 0)

	"n_samples": {"validation": 1860, "test": 3594},
	"n_samples": {"test": 3594},

Adds VocalSound dataset #2337

Are you sure you want to change the base?

Adds VocalSound dataset #2337

Conversation

mina-parham commented Mar 12, 2025 • edited by Samoed Loading

Code Quality

Documentation

Testing

Adding datasets checklist

Samoed Mar 12, 2025

Choose a reason for hiding this comment

Samoed Mar 13, 2025

Choose a reason for hiding this comment

isaac-chung left a comment

Choose a reason for hiding this comment

isaac-chung left a comment

Choose a reason for hiding this comment

isaac-chung Apr 14, 2025

Choose a reason for hiding this comment

Samoed Apr 14, 2025

Choose a reason for hiding this comment

isaac-chung Apr 14, 2025

Choose a reason for hiding this comment

mina-parham commented Mar 12, 2025 •

edited by Samoed

Loading