[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset by AdnanElAssadi56 · Pull Request #4199 · embeddings-benchmark/mteb

AdnanElAssadi56 · 2026-03-05T07:26:38Z

(From closed PR)

Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797

Also includes some remaining components from the parallel video integration work we accidently did.

Samoed · 2026-03-05T08:33:22Z

mteb/abstasks/classification.py

+                modality_to_column = {
+                    "video": "video",
+                    "audio": "audio",
+                    "image": "image",
+                }


Can we just extend input_column_name to list?

This will also cause changes in dataloader; we can do this separately.

What changs? I think easier to use list for processing rather than processing like this

mteb/models/model_implementations/pe_av_models.py

AdnanElAssadi56 · 2026-03-09T08:41:25Z

@Samoed @isaac-chung Changed input_column to list.

AdnanElAssadi56 · 2026-03-09T08:56:59Z

lint is giving error because list is mutable

isaac-chung · 2026-03-09T09:00:14Z

It's looking for something like this I think:

from typing import ClassVar

input_column_name: ClassVar[list[str]] = ["video", "audio"]

mteb/abstasks/classification.py

mteb/tasks/video/classification/eng/kinetics400_classification.py

mteb/abstasks/clustering.py

mteb/tasks/video/clustering/eng/ravdess_av_clustering.py

Samoed · 2026-03-10T07:36:40Z

mteb/_create_dataloaders.py

+        if isinstance(input_column, str):
+            text_data = dataset[input_column]
+        elif "text" in input_column and "text" in dataset.column_names:
+            text_data = dataset["text"]
+        else:
+            raise ValueError(
+                "Cannot determine which column to use for text evaluation. "
+                "Please include 'text' in input_column_name or use a single string."
+            )


Why this needed?

When the text-evaluator needs to pull text data, this ensures it selects the "text" column from that input column list rather than crashing or trying to embed an audio column as text.

mteb/tasks/video/retrieval/eng/msr_vtt.py

AdnanElAssadi56 · 2026-03-12T17:57:11Z

@Samoed Can you give a look here when you have the time?

Samoed · 2026-03-13T11:48:15Z

mteb/abstasks/classification.py

    train_split: str = "train"
    label_column_name: str = "label"
-    input_column_name: str = "text"
+    input_column_name: str | Sequence[str] = "text"


Can we rollback this since we don't use multiple columns? I think you can rollback almost all changes in abstasks

Don't we still have multiple inputs used in these tasks? I find the current _combine_modalities(example) in every task is way too bulky

For now, we don't have such tasks. I think it's better to combine video in tasks code itself, to have ability to separate video's audio and just audio. We discussed this in #4148 (comment). Current implementation is just hack around torchcodec/datasets

Video is a collection of frames, without audio. Both tasks in this PR use audio and video. That's multiple columns. The linked comment mentions preview and doesn't seem relevant to the point to the change: to clean up this hack of combining modalities in the task and handle this in AbsTask instead.

handle this in AbsTask instead

I don't think that this is good idea to have modality specific code in AbsTask. Also, this is not clear when task needs combination and when don't need

The linked comment mentions preview

Yes, this is part of discussion. Main discussion of input format was in slack channel

For now, we don't have such tasks

What's your goal here? I don't quite understand still. The two tasks have video and audio modalities, and it doesn't seem like you agree

Input column names being a list has nothing to do with supporting a list of videos or a list of images. It just means inputs have multiple input columns, which can be video and text, or video and audio etc. Here, the two tasks have video and audio, and handling it in abstask is cleaner than repeating the code in each task.

What's your goal here? I don't quite understand still.

We need to have flexible enough format to support any combination of modalities. If we treated video as just frames and audio independently, then would be hard to distinguish tasks that have both video and audio (independent from video) inputs. We can of course name like video and audio_video and process this in abstask automatically, but I don't think that dataset should be changed during evaluation

Input column names being a list has nothing to do with supporting a list of videos or a list of images. It just means inputs have multiple input columns, which can be video and text, or video and audio etc

I know. But if we don't have such tasks, for now we can remove this.

Here, the two tasks have video and audio, and handling it in abstask is cleaner than repeating the code in each task

Tasks should have dataset in our expected format after load_data

would be hard to distinguish tasks that have both video and audio (independent from video) inputs. We can of course name like video and audio_video

We don't have such tasks right now, so we don't need it, following your reasoning.

Tasks should have dataset in our expected format after load_data

They do. They should either be re-uploaded or handled in AbsTask when the loading code repeats itself, as it does now.

Samoed · 2026-03-13T11:51:23Z

mteb/_create_dataloaders.py

+        if "video" in inputs[0]:
+            return self.video_collator(inputs)
+        if "audio" in inputs[0]:
+            return self.audio_collator(inputs)


What if video and audio in one task?

isaac-chung · 2026-03-13T16:13:26Z

How do you tell VA2C and V2C tasks apart? Is it that: only in VA2C tasks, we process the audio, regardless if it's from the video or in a separate column?

AdnanElAssadi56 · 2026-03-17T17:41:04Z

@Samoed @isaac-chung @KennethEnevoldsen
This is somewhat of a blocker right now. Can we discuss the approach here if you are available?

isaac-chung · 2026-03-17T20:15:52Z

One main thing we should clarify is how to handle video with and without audio + separate audio

AdnanElAssadi56 · 2026-03-17T20:44:53Z

One main thing we should clarify is how to handle video with and without audio + separate audio

My thought is to not include cases for separate audio for now. If they later arise, we can handle it in metadata or a col in the hf dataset or something.

isaac-chung · 2026-03-17T20:52:15Z

Okay so how do you tell VA2C and V2C tasks apart?

AdnanElAssadi56 · 2026-03-17T21:06:46Z

Okay so how do you tell VA2C and V2C tasks apart?

Yes, I raised this earlier with @Samoed. I still prefer the modality list approach, but it seemed like @KennethEnevoldsen and @Samoed wanted to bundle audio into the video object. The distinction would basically have to be captured via something like [text, video] vs [text, video, audio] in task modalities.

Would it be okay to keep video and audio as separate inputs instead of combining them under video? Or, if we want to combine, is it problematic to do it at the collator level? I think it is cleaner than what we have.

P.S. When I said separate audio above, I meant audio from another video.

AdnanElAssadi56 and others added 26 commits March 5, 2026 02:05

Adding video modality

2b411bb

Add Kinetics-400 dataset

fd0ce74

Add pe_av model

a65f505

fix typo

ecca13e

fix collator bug

8210f82

Edit selecting column in classification abstask

f4e0ece

Properly handle frames in PE_AV

8f67fb7

add self kwarg to method

80d9217

Add audio collator

c01d591

fix type error

287d47c

fix audio_video embeds object handling

66c108f

Add Ravdess_av clustering

b24794b

fix task metadata

82ccf4d

start video integration

6979034

start video integration

4af8520

upd task structure

fa753b4

upd video input type

f5e7a8f

combine video and audio to dict

32f3b4f

fix task side

77e964a

fix pe_av model

f1b7989

lower writer batch size

95d75d9

fix col labels

e59f283

lint

5321b3c

add pe_av model metadata

7b36363

fix datasets metadata

05cd7f6

remove accidently commited files

23c3135

Samoed added new model Questions related to adding a new model to the benchmark new dataset Issues related to adding a new task or dataset video video extension labels Mar 5, 2026

Samoed reviewed Mar 5, 2026

View reviewed changes

AdnanElAssadi56 added 2 commits March 9, 2026 05:02

add classvar

75bc5c7

add str to classvar

4c87896

Samoed reviewed Mar 9, 2026

View reviewed changes

AdnanElAssadi56 added 4 commits March 9, 2026 17:25

Change list to sequence

cb39536

lint + type check error

61c775f

edit dataloader and msrvtt handling of input column

400925b

move seqeuence out of type checking

64c94b5

Samoed reviewed Mar 10, 2026

View reviewed changes

AdnanElAssadi56 added 2 commits March 10, 2026 15:00

fix random baseline

a131a89

add collator to random baseline

73bf160

Samoed reviewed Mar 10, 2026

View reviewed changes

mteb/tasks/video/retrieval/eng/msr_vtt.py Show resolved Hide resolved

AdnanElAssadi56 added 7 commits March 10, 2026 17:38

restore previous dict structure + make audio optional

978622e

clean structure

939eefa

lint

57eb8d9

safety check

ac7484f

decrease writer batch size

bb68de2

match msrvtt format

91cada2

type check fix

56c243f

Samoed reviewed Mar 13, 2026

View reviewed changes

Samoed requested a review from KennethEnevoldsen March 13, 2026 16:19

Conversation

AdnanElAssadi56 commented Mar 5, 2026 • edited by Samoed Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AdnanElAssadi56 commented Mar 9, 2026

Uh oh!

AdnanElAssadi56 commented Mar 9, 2026

Uh oh!

isaac-chung commented Mar 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AdnanElAssadi56 commented Mar 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Samoed Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

isaac-chung commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdnanElAssadi56 commented Mar 17, 2026

Uh oh!

isaac-chung commented Mar 17, 2026

Uh oh!

AdnanElAssadi56 commented Mar 17, 2026

Uh oh!

isaac-chung commented Mar 17, 2026

Uh oh!

AdnanElAssadi56 commented Mar 17, 2026 • edited by isaac-chung Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AdnanElAssadi56 commented Mar 5, 2026 •

edited by Samoed

Loading

Samoed Mar 13, 2026 •

edited

Loading

Samoed Mar 13, 2026 •

edited

Loading

isaac-chung commented Mar 13, 2026 •

edited

Loading

AdnanElAssadi56 commented Mar 17, 2026 •

edited by isaac-chung

Loading