VideoITG: Improving Multimodal Video Understanding with Instructed Temporal Grounding

Introduction

VideoITG is an innovative approach to video understanding, designed to enhance the performance of Video Large Language Models (Video-LLMs) through informed frame selection. It tackles the complexities of real-world video scenarios by aligning frame sampling with user instructions. VideoITG employs a comprehensive pipeline that includes detailed clip-level description generation, question-guided clip retrieval, and task-specific frame selection. This results in a robust dataset of 40K videos and 480K annotations. The plug-and-play model leverages visual language alignment and reasoning, achieving superior results across multimodal benchmarks, particularly in tasks requiring precise temporal grounding.

Updates

[2025/07/25 (ETA)] Code and checkpoint release.
[2025/07/18] Technical report release. [arXiv]

Models & Performance

Here is the model trained on our organized 1.8M supervised fine-tuning data.

Model	VideoLLM	Frames	LongVideoBench	MLVU	VideoMME	CG-Bench
VideoITG-7B	InternVL2.5-8B	32	61.9 (+2.9%)	75.0 (+7.8%)	67.3 (+4.0%)	46.7 (+7.0%)
VideoITG-7B	InternVL2.5-26B	32	63.0 (+1.0%)	78.9 (+6.1%)	69.9 (+2.5)	48.7 (+6.0%)
VideoITG-7B	LLaVA-Video-7B	32	61.6 (3.6%)	74.6 (+8.6%)	66.1 (+3.0%)	42.8 (+9.0%)
VideoITG-7B	LLaVA-Video-7B	64	60.9 (+7.4%)	76.3 (+7.6%)	66.4 (+1.9%)	42.9 (8.1%)

Visual Examples

Install

Please following the guide here to prepare the environment on Linux OS.

Clone this repository

git clone https://github.com/NVlabs/VideoITG.git
cd VideoITG

Create environment and install package

conda create -n videoitg python=3.12 -y
conda activate videoitg
pip install --upgrade pip  # enable PEP 660 support
pip install -r requirements.txt

Install additional packages for training cases

pip install flash-attn==2.4.2.dev3 --no-build-isolation

Training Data

VideoLLM Data

For VideoLLM training, wew use the same data and stragety as LLaVA-Video, including the Pretraining Data, OV SFT Data and LLaVA-Video-178K Data.

VideoITG Data

Checkpoint Preparation

We recommend using the VideoLLM checkpoints we provided here to reproduce our results.

Training

You can train the model following:

bash scripts/videoitg/finetune-uni-64frame-qwen2-7b-grounding.sh finetune 16

In default we use 128 NVIDIA A100 80G GPU to conduct the training. Please modify the per_device_train_batch_size and gradient_accumulation_steps if you are using different amount of GPUs. The training for VideoITG requires 4 hours.

Notes

If you have limited GPU resources or memory, please considering the following:

use gradient accumulation and reduce the per-device batch size

Evaluation

Evaluation with LMMs-Eval

For evaluation, we use Videomme as an example. First, using this command to run our VideoITG model and get the instructed grounding results.

bash scripts/eval_lmms_eval/videomme_grounding.sh $REPO_ID_OR_LOCAL_PATH $MODEL_NAME $CONV_MODE

Notes

In our paper, we report the results of CG-Bench mini, which includes 3,000 QA pairs.

Citation

If you find this project useful, please cite our work:

@misc{wang2025videoitgmultimodalvideounderstanding,
      title={VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding}, 
      author={Shihao Wang and Guo Chen and De-an Huang and Zhiqi Li and Minghan Li and Guilin Li and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
      year={2025},
      eprint={2507.13353},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.13353}, 
}

Acknowledgement

LLaVA: the codebase we built upon.
Eagle: the codebase we built upon. Thanks for the great pioneer open-source project!
LMMs-Eval: many thanks to the LMMs-Lab for their wonderful and easy-to-use evaluation tools!
LLaVA-Video-178K: we train our model with the data from LLaVA-Video-178k.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoITG: Improving Multimodal Video Understanding with Instructed Temporal Grounding

Introduction

Updates

Contents

Models & Performance

Visual Examples

Install

Training Data

VideoLLM Data

VideoITG Data

Checkpoint Preparation

Training

Notes

Evaluation

Evaluation with LMMs-Eval

Notes

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

NVlabs/VideoITG

Folders and files

Latest commit

History

Repository files navigation

VideoITG: Improving Multimodal Video Understanding with Instructed Temporal Grounding

Introduction

Updates

Contents

Models & Performance

Visual Examples

Install

Training Data

VideoLLM Data

VideoITG Data

Checkpoint Preparation

Training

Notes

Evaluation

Evaluation with LMMs-Eval

Notes

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages