Skip to content

[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199

Open
AdnanElAssadi56 wants to merge 54 commits intoembeddings-benchmark:mainfrom
AdnanElAssadi56:mveb-video-integration
Open

[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199
AdnanElAssadi56 wants to merge 54 commits intoembeddings-benchmark:mainfrom
AdnanElAssadi56:mveb-video-integration

Conversation

@AdnanElAssadi56
Copy link
Contributor

@AdnanElAssadi56 AdnanElAssadi56 commented Mar 5, 2026

(From closed PR)

Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797

Also includes some remaining components from the parallel video integration work we accidently did.

@Samoed Samoed added new model Questions related to adding a new model to the benchmark new dataset Issues related to adding a new task or dataset video video extension labels Mar 5, 2026
Comment on lines +198 to +202
modality_to_column = {
"video": "video",
"audio": "audio",
"image": "image",
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just extend input_column_name to list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also cause changes in dataloader; we can do this separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What changs? I think easier to use list for processing rather than processing like this

@AdnanElAssadi56
Copy link
Contributor Author

@Samoed @isaac-chung Changed input_column to list.

@AdnanElAssadi56
Copy link
Contributor Author

lint is giving error because list is mutable

@isaac-chung
Copy link
Collaborator

It's looking for something like this I think:

from typing import ClassVar

input_column_name: ClassVar[list[str]] = ["video", "audio"]

Comment on lines +622 to +630
if isinstance(input_column, str):
text_data = dataset[input_column]
elif "text" in input_column and "text" in dataset.column_names:
text_data = dataset["text"]
else:
raise ValueError(
"Cannot determine which column to use for text evaluation. "
"Please include 'text' in input_column_name or use a single string."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the text-evaluator needs to pull text data, this ensures it selects the "text" column from that input column list rather than crashing or trying to embed an audio column as text.

@AdnanElAssadi56
Copy link
Contributor Author

@Samoed Can you give a look here when you have the time?

train_split: str = "train"
label_column_name: str = "label"
input_column_name: str = "text"
input_column_name: str | Sequence[str] = "text"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rollback this since we don't use multiple columns? I think you can rollback almost all changes in abstasks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still have multiple inputs used in these tasks? I find the current _combine_modalities(example) in every task is way too bulky

Copy link
Member

@Samoed Samoed Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we don't have such tasks. I think it's better to combine video in tasks code itself, to have ability to separate video's audio and just audio. We discussed this in #4148 (comment). Current implementation is just hack around torchcodec/datasets

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Video is a collection of frames, without audio. Both tasks in this PR use audio and video. That's multiple columns. The linked comment mentions preview and doesn't seem relevant to the point to the change: to clean up this hack of combining modalities in the task and handle this in AbsTask instead.

Copy link
Member

@Samoed Samoed Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle this in AbsTask instead

I don't think that this is good idea to have modality specific code in AbsTask. Also, this is not clear when task needs combination and when don't need

The linked comment mentions preview

Yes, this is part of discussion. Main discussion of input format was in slack channel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we don't have such tasks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your goal here? I don't quite understand still. The two tasks have video and audio modalities, and it doesn't seem like you agree

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input column names being a list has nothing to do with supporting a list of videos or a list of images. It just means inputs have multiple input columns, which can be video and text, or video and audio etc. Here, the two tasks have video and audio, and handling it in abstask is cleaner than repeating the code in each task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your goal here? I don't quite understand still.

We need to have flexible enough format to support any combination of modalities. If we treated video as just frames and audio independently, then would be hard to distinguish tasks that have both video and audio (independent from video) inputs. We can of course name like video and audio_video and process this in abstask automatically, but I don't think that dataset should be changed during evaluation

Input column names being a list has nothing to do with supporting a list of videos or a list of images. It just means inputs have multiple input columns, which can be video and text, or video and audio etc

I know. But if we don't have such tasks, for now we can remove this.

Here, the two tasks have video and audio, and handling it in abstask is cleaner than repeating the code in each task

Tasks should have dataset in our expected format after load_data

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be hard to distinguish tasks that have both video and audio (independent from video) inputs. We can of course name like video and audio_video

We don't have such tasks right now, so we don't need it, following your reasoning.

Tasks should have dataset in our expected format after load_data

They do. They should either be re-uploaded or handled in AbsTask when the loading code repeats itself, as it does now.

Comment on lines +853 to +856
if "video" in inputs[0]:
return self.video_collator(inputs)
if "audio" in inputs[0]:
return self.audio_collator(inputs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if video and audio in one task?

@isaac-chung
Copy link
Collaborator

isaac-chung commented Mar 13, 2026

How do you tell VA2C and V2C tasks apart? Is it that: only in VA2C tasks, we process the audio, regardless if it's from the video or in a separate column?

@AdnanElAssadi56
Copy link
Contributor Author

@Samoed @isaac-chung @KennethEnevoldsen
This is somewhat of a blocker right now. Can we discuss the approach here if you are available?

@isaac-chung
Copy link
Collaborator

One main thing we should clarify is how to handle video with and without audio + separate audio

@AdnanElAssadi56
Copy link
Contributor Author

One main thing we should clarify is how to handle video with and without audio + separate audio

My thought is to not include cases for separate audio for now. If they later arise, we can handle it in metadata or a col in the hf dataset or something.

@isaac-chung
Copy link
Collaborator

Okay so how do you tell VA2C and V2C tasks apart?

@AdnanElAssadi56
Copy link
Contributor Author

AdnanElAssadi56 commented Mar 17, 2026

Okay so how do you tell VA2C and V2C tasks apart?

Yes, I raised this earlier with @Samoed. I still prefer the modality list approach, but it seemed like @KennethEnevoldsen and @Samoed wanted to bundle audio into the video object. The distinction would basically have to be captured via something like [text, video] vs [text, video, audio] in task modalities.

Would it be okay to keep video and audio as separate inputs instead of combining them under video? Or, if we want to combine, is it problematic to do it at the collator level? I think it is cleaner than what we have.

P.S. When I said separate audio above, I meant audio from another video.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new dataset Issues related to adding a new task or dataset new model Questions related to adding a new model to the benchmark video video extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add model: PE-AV

4 participants