STM

[CVPR 2025] This is the official repository for our paper: "Video Language Model Pretraining with Spatio-temporal Masking".

By Yue Wu, Zhaobo Qi, Junshu Sun, Yaowei Wang, Qingming Huang and Shuhui Wang.

Introduction

The development of self-supervised video-language models based on mask learning has significantly advanced downstream video tasks. These models leverage masked reconstruction to facilitate joint learning of visual and linguistic information. However, recent study reveals that reconstructing image features yields superior downstream performance compared to video feature reconstruction. We hypothesize that this performance gap stems from the way how masking strategies influence the model's attention to temporal dynamics. To validate this hypothesis, we performed two sets of experiments that demonstrate that alignment between the masked target and the reconstruction target is crucial for self-supervised video-language learning. Based on these findings, we propose a spatio-temporal masking strategy (STM) for video-language model pretraining that operates across adjacent frames, and a decoder leverages semantic information to enhance the spatio-temporal representations of masked tokens. Thanks to the combination of masking strategy and reconstruction decoder, STM enforces the model to learn spatio-temporal feature representation comprehensively. Experiments in three video understanding downstream tasks validate the superiority of our method.

Installation

Environment Setup

We mainly follow VINDLU to prepare the enviroment.

# create 
conda env create -f vl.yml
# activate
conda activate vl

Data Preparation

We mainly follow UMT to prepare the data.

Run

Pre-Training

Train STM:

sh ./exp/pretraining/b16_5m.sh

Zero-shot validation

Validate with pretrained model:

bash ./exp/zero_shot/ret_msrvtt/b16_25m.sh

Finetuning

Validate with pretrained model:

sh ./exp/finetuning/ret_msrvtt/b16_5m.sh

Cite

If you use our dataset or method in your research, please cite our paper:

@inproceedings{wu2025video,
  title={Video Language Model Pretraining with Spatio-temporal Masking},
  author={Wu, Yue and Qi, Zhaobo and Sun, Junshu and Wang, Yaowei and Huang, Qingming and Wang, Shuhui},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={8557--8567},
  year={2025}
}

Acknowledgement

This repository is built based on UMT repository.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STM

Introduction

Installation

Environment Setup

Data Preparation

Run

Pre-Training

Zero-shot validation

Finetuning

Cite

Acknowledgement

About

Uh oh!

Releases

Packages

May2333/STM

Folders and files

Latest commit

History

Repository files navigation

STM

Introduction

Installation

Environment Setup

Data Preparation

Run

Pre-Training

Zero-shot validation

Finetuning

Cite

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages