-
Notifications
You must be signed in to change notification settings - Fork 448
Audio Retrieval Dataset: JLCorpus #2927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: maeb
Are you sure you want to change the base?
Audio Retrieval Dataset: JLCorpus #2927
Conversation
class JLCorpusA2TRetrieval(AbsTaskAny2AnyRetrieval): | ||
metadata = TaskMetadata( | ||
name="JLCorpusA2TRetrieval", | ||
description=( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from a speech segment retrieve the emotion (e..g text label "angry"?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dataset contains transcriptions that have been said with different emotions. Supposedly, we should retrieve the transcription invariant to emotion. Maybe that can be the evaluation. Only problem here could be if evaluation is done with ids rather than on unique "text" columns.
Processed Dataset: https://huggingface.co/datasets/mteb/JL-Corpus_a2t
Original Dataset: https://huggingface.co/datasets/CLAPv2/JL-Corpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so from the n versions of a text, it should retrieve the one that best matches the emotion. However, doesn't the task retrieve over all possible pairs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the way it is currently done (text column has no emotion info), I figured we should just see if audio segments are able to consistently retrieve the correct text column even if audio has varying emotions associated with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like something that is easier to test prior to merging. Seems like from the results that they can't.
Results on
laion/clap-htsat-fused
:JLCorpusT2ARetrieval.json
JLCorpusA2TRetrieval.json
P.S This task could be problematic if evaluation is done on uniqueness of query-id rather than actual query.