Audio Retrieval Dataset: JLCorpus #2927

AdnanElAssadi56 · 2025-07-21T18:05:46Z

Results on laion/clap-htsat-fused:

JLCorpusT2ARetrieval.json
JLCorpusA2TRetrieval.json

P.S This task could be problematic if evaluation is done on uniqueness of query-id rather than actual query.

mteb/tasks/Audio/Any2AnyRetrieval/JLCorpus.py

KennethEnevoldsen · 2025-07-22T13:27:24Z

mteb/tasks/Audio/Any2AnyRetrieval/JLCorpus.py

+class JLCorpusA2TRetrieval(AbsTaskAny2AnyRetrieval):
+    metadata = TaskMetadata(
+        name="JLCorpusA2TRetrieval",
+        description=(


So from a speech segment retrieve the emotion (e..g text label "angry"?)

This dataset contains transcriptions that have been said with different emotions. Supposedly, we should retrieve the transcription invariant to emotion. Maybe that can be the evaluation. Only problem here could be if evaluation is done with ids rather than on unique "text" columns.

Processed Dataset: https://huggingface.co/datasets/mteb/JL-Corpus_a2t
Original Dataset: https://huggingface.co/datasets/CLAPv2/JL-Corpus

Right, so from the n versions of a text, it should retrieve the one that best matches the emotion. However, doesn't the task retrieve over all possible pairs?

In the way it is currently done (text column has no emotion info), I figured we should just see if audio segments are able to consistently retrieve the correct text column even if audio has varying emotions associated with it.

Sounds like something that is easier to test prior to merging. Seems like from the results that they can't.

AdnanElAssadi56 and others added 3 commits July 21, 2025 14:00

Added JL Corpus Retrieval Dataset

6004fd5

typo

42bf354

Merge branch 'maeb' into maeb-dataset-jlcorpus

7c05910

KennethEnevoldsen reviewed Jul 22, 2025

View reviewed changes

Update JLCorpus.py

f84b00b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audio Retrieval Dataset: JLCorpus #2927

Audio Retrieval Dataset: JLCorpus #2927

Uh oh!

AdnanElAssadi56 commented Jul 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen Jul 22, 2025

Uh oh!

AdnanElAssadi56 Jul 23, 2025

Uh oh!

KennethEnevoldsen Jul 23, 2025

Uh oh!

AdnanElAssadi56 Jul 23, 2025

Uh oh!

KennethEnevoldsen Jul 25, 2025

Uh oh!

Uh oh!

Audio Retrieval Dataset: JLCorpus #2927

Are you sure you want to change the base?

Audio Retrieval Dataset: JLCorpus #2927

Uh oh!

Conversation

AdnanElAssadi56 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KennethEnevoldsen Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

AdnanElAssadi56 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

AdnanElAssadi56 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

KennethEnevoldsen Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AdnanElAssadi56 commented Jul 21, 2025 •

edited

Loading