[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199
[MVEB] PE-AV Model, Kinetics400 Dataset, RavdessAV Dataset#4199AdnanElAssadi56 wants to merge 54 commits intoembeddings-benchmark:mainfrom
Conversation
mteb/abstasks/classification.py
Outdated
| modality_to_column = { | ||
| "video": "video", | ||
| "audio": "audio", | ||
| "image": "image", | ||
| } |
There was a problem hiding this comment.
Can we just extend input_column_name to list?
There was a problem hiding this comment.
This will also cause changes in dataloader; we can do this separately.
There was a problem hiding this comment.
What changs? I think easier to use list for processing rather than processing like this
|
@Samoed @isaac-chung Changed input_column to list. |
|
lint is giving error because list is mutable |
|
It's looking for something like this I think: from typing import ClassVar
input_column_name: ClassVar[list[str]] = ["video", "audio"] |
| if isinstance(input_column, str): | ||
| text_data = dataset[input_column] | ||
| elif "text" in input_column and "text" in dataset.column_names: | ||
| text_data = dataset["text"] | ||
| else: | ||
| raise ValueError( | ||
| "Cannot determine which column to use for text evaluation. " | ||
| "Please include 'text' in input_column_name or use a single string." | ||
| ) |
There was a problem hiding this comment.
When the text-evaluator needs to pull text data, this ensures it selects the "text" column from that input column list rather than crashing or trying to embed an audio column as text.
|
@Samoed Can you give a look here when you have the time? |
| train_split: str = "train" | ||
| label_column_name: str = "label" | ||
| input_column_name: str = "text" | ||
| input_column_name: str | Sequence[str] = "text" |
There was a problem hiding this comment.
Can we rollback this since we don't use multiple columns? I think you can rollback almost all changes in abstasks
There was a problem hiding this comment.
Don't we still have multiple inputs used in these tasks? I find the current _combine_modalities(example) in every task is way too bulky
There was a problem hiding this comment.
For now, we don't have such tasks. I think it's better to combine video in tasks code itself, to have ability to separate video's audio and just audio. We discussed this in #4148 (comment). Current implementation is just hack around torchcodec/datasets
There was a problem hiding this comment.
Video is a collection of frames, without audio. Both tasks in this PR use audio and video. That's multiple columns. The linked comment mentions preview and doesn't seem relevant to the point to the change: to clean up this hack of combining modalities in the task and handle this in AbsTask instead.
There was a problem hiding this comment.
handle this in AbsTask instead
I don't think that this is good idea to have modality specific code in AbsTask. Also, this is not clear when task needs combination and when don't need
The linked comment mentions preview
Yes, this is part of discussion. Main discussion of input format was in slack channel
There was a problem hiding this comment.
For now, we don't have such tasks
There was a problem hiding this comment.
What's your goal here? I don't quite understand still. The two tasks have video and audio modalities, and it doesn't seem like you agree
There was a problem hiding this comment.
Input column names being a list has nothing to do with supporting a list of videos or a list of images. It just means inputs have multiple input columns, which can be video and text, or video and audio etc. Here, the two tasks have video and audio, and handling it in abstask is cleaner than repeating the code in each task.
There was a problem hiding this comment.
What's your goal here? I don't quite understand still.
We need to have flexible enough format to support any combination of modalities. If we treated video as just frames and audio independently, then would be hard to distinguish tasks that have both video and audio (independent from video) inputs. We can of course name like video and audio_video and process this in abstask automatically, but I don't think that dataset should be changed during evaluation
Input column names being a list has nothing to do with supporting a list of videos or a list of images. It just means inputs have multiple input columns, which can be video and text, or video and audio etc
I know. But if we don't have such tasks, for now we can remove this.
Here, the two tasks have video and audio, and handling it in abstask is cleaner than repeating the code in each task
Tasks should have dataset in our expected format after load_data
There was a problem hiding this comment.
would be hard to distinguish tasks that have both video and audio (independent from video) inputs. We can of course name like video and audio_video
We don't have such tasks right now, so we don't need it, following your reasoning.
Tasks should have dataset in our expected format after load_data
They do. They should either be re-uploaded or handled in AbsTask when the loading code repeats itself, as it does now.
| if "video" in inputs[0]: | ||
| return self.video_collator(inputs) | ||
| if "audio" in inputs[0]: | ||
| return self.audio_collator(inputs) |
There was a problem hiding this comment.
What if video and audio in one task?
|
How do you tell VA2C and V2C tasks apart? Is it that: only in VA2C tasks, we process the audio, regardless if it's from the video or in a separate column? |
|
@Samoed @isaac-chung @KennethEnevoldsen |
|
One main thing we should clarify is how to handle video with and without audio + separate audio |
My thought is to not include cases for separate audio for now. If they later arise, we can handle it in metadata or a col in the hf dataset or something. |
|
Okay so how do you tell VA2C and V2C tasks apart? |
Yes, I raised this earlier with @Samoed. I still prefer the modality list approach, but it seemed like @KennethEnevoldsen and @Samoed wanted to bundle audio into the video object. The distinction would basically have to be captured via something like [text, video] vs [text, video, audio] in task modalities. Would it be okay to keep video and audio as separate inputs instead of combining them under video? Or, if we want to combine, is it problematic to do it at the collator level? I think it is cleaner than what we have. P.S. When I said separate audio above, I meant audio from another video. |
(From closed PR)
Adds the following:
mteb/kinetics-400
mteb/RAVDESS_AV
PE-AV (Facebook) Close #3797
Also includes some remaining components from the parallel video integration work we accidently did.