Thank you for your work! I have a question:
In the paper, it is stated: "We also learn a CLIP-style InternVideo2 indicated by InternVideo2clip. It is post-pretrained from InternVideo2s2 by only preserving video and text encoders and contrastive loss."
May I ask what training dataset was used for InternVideo2clip? Does it include any Chinese data? Approximately how much data would be required to fine-tune it effectively?
I noticed that only the attnpool of the vision encoder has been released in the official weights. https://huggingface.co/OpenGVLab/InternVideo2-Stage2_1B-224p-f4