Skip to content

NVlabs/VideoITG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

VideoITG: Improving Multimodal Video Understanding with Instructed Temporal Grounding


Code License Model License

Introduction

VideoITG is an innovative approach to video understanding, designed to enhance the performance of Video Large Language Models (Video-LLMs) through informed frame selection. It tackles the complexities of real-world video scenarios by aligning frame sampling with user instructions. VideoITG employs a comprehensive pipeline that includes detailed clip-level description generation, question-guided clip retrieval, and task-specific frame selection. This results in a robust dataset of 40K videos and 480K annotations. The plug-and-play model leverages visual language alignment and reasoning, achieving superior results across multimodal benchmarks, particularly in tasks requiring precise temporal grounding.

Updates

  • [2025/07/25 (ETA)] Code and checkpoint release.
  • [2025/07/18] Technical report release. [arXiv]

Contents

Models & Performance

Here is the model trained on our organized 1.8M supervised fine-tuning data.

Model              VideoLLM              Frames LongVideoBench MLVU VideoMME CG-Bench
VideoITG-7B InternVL2.5-8B 32 61.9 (+2.9%) 75.0 (+7.8%) 67.3 (+4.0%) 46.7 (+7.0%)
VideoITG-7B InternVL2.5-26B 32 63.0 (+1.0%) 78.9 (+6.1%) 69.9 (+2.5) 48.7 (+6.0%)
VideoITG-7B LLaVA-Video-7B 32 61.6 (3.6%) 74.6 (+8.6%) 66.1 (+3.0%) 42.8 (+9.0%)
VideoITG-7B LLaVA-Video-7B 64 60.9 (+7.4%) 76.3 (+7.6%) 66.4 (+1.9%) 42.9 (8.1%)

Visual Examples



Install

Please following the guide here to prepare the environment on Linux OS.

  1. Clone this repository
git clone https://github.com/NVlabs/VideoITG.git
cd VideoITG
  1. Create environment and install package
conda create -n videoitg python=3.12 -y
conda activate videoitg
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt
  1. Install additional packages for training cases
pip install flash-attn==2.4.2.dev3 --no-build-isolation

Training Data

VideoLLM Data

For VideoLLM training, wew use the same data and stragety as LLaVA-Video, including the Pretraining Data, OV SFT Data and LLaVA-Video-178K Data.

VideoITG Data

Checkpoint Preparation

We recommend using the VideoLLM checkpoints we provided here to reproduce our results.

Training

You can train the model following:

bash scripts/videoitg/finetune-uni-64frame-qwen2-7b-grounding.sh finetune 16

In default we use 128 NVIDIA A100 80G GPU to conduct the training. Please modify the per_device_train_batch_size and gradient_accumulation_steps if you are using different amount of GPUs. The training for VideoITG requires 4 hours.

Notes

If you have limited GPU resources or memory, please considering the following:

  • use gradient accumulation and reduce the per-device batch size

Evaluation

Evaluation with LMMs-Eval

For evaluation, we use Videomme as an example. First, using this command to run our VideoITG model and get the instructed grounding results.

bash scripts/eval_lmms_eval/videomme_grounding.sh $REPO_ID_OR_LOCAL_PATH $MODEL_NAME $CONV_MODE

Notes

In our paper, we report the results of CG-Bench mini, which includes 3,000 QA pairs.

Citation

If you find this project useful, please cite our work:

@misc{wang2025videoitgmultimodalvideounderstanding,
      title={VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding}, 
      author={Shihao Wang and Guo Chen and De-an Huang and Zhiqi Li and Minghan Li and Guilin Li and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
      year={2025},
      eprint={2507.13353},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13353}, 
}

Acknowledgement

  • LLaVA: the codebase we built upon.
  • Eagle: the codebase we built upon. Thanks for the great pioneer open-source project!
  • LMMs-Eval: many thanks to the LMMs-Lab for their wonderful and easy-to-use evaluation tools!
  • LLaVA-Video-178K: we train our model with the data from LLaVA-Video-178k.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published