Questions About InternVideo2clip Training Data and Fine-Tuning Requirements

Thank you for your work! I have a question: 

In the paper, it is stated: "We also learn a CLIP-style InternVideo2 indicated by InternVideo2clip. It is post-pretrained from InternVideo2s2 by only preserving video and text encoders and contrastive loss."

May I ask what training dataset was used for InternVideo2clip? Does it include any Chinese data? Approximately how much data would be required to fine-tune it effectively?

I noticed that only the attnpool of the vision encoder has been released in the official weights. [https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4](url)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions About InternVideo2clip Training Data and Fine-Tuning Requirements #293

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions About InternVideo2clip Training Data and Fine-Tuning Requirements #293

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions