Skip to content
/ STM Public

[CVPR 2025] This is the official repository for our paper: "Video Language Model Pretraining with Spatio-temporal Masking".

Notifications You must be signed in to change notification settings

May2333/STM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

STM

[CVPR 2025] This is the official repository for our paper: "Video Language Model Pretraining with Spatio-temporal Masking".

By Yue Wu, Zhaobo Qi, Junshu Sun, Yaowei Wang, Qingming Huang and Shuhui Wang.

Introduction

The development of self-supervised video-language models based on mask learning has significantly advanced downstream video tasks. These models leverage masked reconstruction to facilitate joint learning of visual and linguistic information. However, recent study reveals that reconstructing image features yields superior downstream performance compared to video feature reconstruction. We hypothesize that this performance gap stems from the way how masking strategies influence the model's attention to temporal dynamics. To validate this hypothesis, we performed two sets of experiments that demonstrate that alignment between the masked target and the reconstruction target is crucial for self-supervised video-language learning. Based on these findings, we propose a spatio-temporal masking strategy (STM) for video-language model pretraining that operates across adjacent frames, and a decoder leverages semantic information to enhance the spatio-temporal representations of masked tokens. Thanks to the combination of masking strategy and reconstruction decoder, STM enforces the model to learn spatio-temporal feature representation comprehensively. Experiments in three video understanding downstream tasks validate the superiority of our method.

img

Installation

Environment Setup

We mainly follow VINDLU to prepare the enviroment.

# create 
conda env create -f vl.yml
# activate
conda activate vl

Data Preparation

We mainly follow UMT to prepare the data.

Run

Pre-Training

Train STM:

sh ./exp/pretraining/b16_5m.sh

Zero-shot validation

Validate with pretrained model:

bash ./exp/zero_shot/ret_msrvtt/b16_25m.sh

Finetuning

Validate with pretrained model:

sh ./exp/finetuning/ret_msrvtt/b16_5m.sh

Cite

If you use our dataset or method in your research, please cite our paper:

@inproceedings{wu2025video,
  title={Video Language Model Pretraining with Spatio-temporal Masking},
  author={Wu, Yue and Qi, Zhaobo and Sun, Junshu and Wang, Yaowei and Huang, Qingming and Wang, Shuhui},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={8557--8567},
  year={2025}
}

Acknowledgement

This repository is built based on UMT repository.

About

[CVPR 2025] This is the official repository for our paper: "Video Language Model Pretraining with Spatio-temporal Masking".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published