-
Notifications
You must be signed in to change notification settings - Fork 380
Adds VocalSound dataset #2337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: maeb
Are you sure you want to change the base?
Adds VocalSound dataset #2337
Conversation
def dataset_transform(self): | ||
self.dataset["train"] = self.dataset["test"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use cross-validation instead of test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test split should be removed, because for now training and testing would be on same test split
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just a few small comments.
Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
…-benchmark/mteb into AddDataset/VocalSound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress so far.
- Please remove all other changes related to
dict.fromkeys
in this PR. (Can be a sep PR tomain
) - Please run this dataset using one of the existing audio embedding models in the branch to confirm that your changes work
table_dict[lang] = {k: 0 for k in sorted(get_args(TASK_TYPE))} | ||
table_dict[lang] = dict.fromkeys(sorted(get_args(TASK_TYPE)), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if these changes were reverted. Feel free to open a separate issue, and a PR to main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this file was formatted with different ruff version
month=may } | ||
}""", | ||
descriptive_stats={ | ||
"n_samples": {"validation": 1860, "test": 3594}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match eval_splits
"n_samples": {"validation": 1860, "test": 3594}, | |
"n_samples": {"test": 3594}, |
Closes #2313
Code Quality
make lint
to maintain consistent style.Documentation
Testing
make test-with-coverage
.make test
ormake test-with-coverage
to ensure no existing functionality is broken.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.