VideoITG is an innovative approach to video understanding, designed to enhance the performance of Video Large Language Models (Video-LLMs) through informed frame selection. It tackles the complexities of real-world video scenarios by aligning frame sampling with user instructions. VideoITG employs a comprehensive pipeline that includes detailed clip-level description generation, question-guided clip retrieval, and task-specific frame selection. This results in a robust dataset of 40K videos and 480K annotations. The plug-and-play model leverages visual language alignment and reasoning, achieving superior results across multimodal benchmarks, particularly in tasks requiring precise temporal grounding.
- [2025/07/25 (ETA)] Code and checkpoint release.
- [2025/07/18] Technical report release. [arXiv]
- Models & Performance
- Visual Examples
- Install
- Training Data
- Checkpoint Preparation
- Training
- Evaluation
Here is the model trained on our organized 1.8M supervised fine-tuning data.
Model | VideoLLM | Frames | LongVideoBench | MLVU | VideoMME | CG-Bench |
---|---|---|---|---|---|---|
VideoITG-7B | InternVL2.5-8B | 32 | 61.9 (+2.9%) | 75.0 (+7.8%) | 67.3 (+4.0%) | 46.7 (+7.0%) |
VideoITG-7B | InternVL2.5-26B | 32 | 63.0 (+1.0%) | 78.9 (+6.1%) | 69.9 (+2.5) | 48.7 (+6.0%) |
VideoITG-7B | LLaVA-Video-7B | 32 | 61.6 (3.6%) | 74.6 (+8.6%) | 66.1 (+3.0%) | 42.8 (+9.0%) |
VideoITG-7B | LLaVA-Video-7B | 64 | 60.9 (+7.4%) | 76.3 (+7.6%) | 66.4 (+1.9%) | 42.9 (8.1%) |
Please following the guide here to prepare the environment on Linux OS.
- Clone this repository
git clone https://github.com/NVlabs/VideoITG.git
cd VideoITG
- Create environment and install package
conda create -n videoitg python=3.12 -y
conda activate videoitg
pip install --upgrade pip # enable PEP 660 support
pip install -r requirements.txt
- Install additional packages for training cases
pip install flash-attn==2.4.2.dev3 --no-build-isolation
For VideoLLM training, wew use the same data and stragety as LLaVA-Video, including the Pretraining Data, OV SFT Data and LLaVA-Video-178K Data.
We recommend using the VideoLLM checkpoints we provided here to reproduce our results.
You can train the model following:
bash scripts/videoitg/finetune-uni-64frame-qwen2-7b-grounding.sh finetune 16
In default we use 128 NVIDIA A100 80G GPU to conduct the training. Please modify the per_device_train_batch_size
and gradient_accumulation_steps
if you are using different amount of GPUs. The training for VideoITG requires 4 hours.
If you have limited GPU resources or memory, please considering the following:
- use gradient accumulation and reduce the per-device batch size
For evaluation, we use Videomme as an example. First, using this command to run our VideoITG model and get the instructed grounding results.
bash scripts/eval_lmms_eval/videomme_grounding.sh $REPO_ID_OR_LOCAL_PATH $MODEL_NAME $CONV_MODE
In our paper, we report the results of CG-Bench mini, which includes 3,000 QA pairs.
If you find this project useful, please cite our work:
@misc{wang2025videoitgmultimodalvideounderstanding,
title={VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding},
author={Shihao Wang and Guo Chen and De-an Huang and Zhiqi Li and Minghan Li and Guilin Li and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
year={2025},
eprint={2507.13353},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.13353},
}
- LLaVA: the codebase we built upon.
- Eagle: the codebase we built upon. Thanks for the great pioneer open-source project!
- LMMs-Eval: many thanks to the LMMs-Lab for their wonderful and easy-to-use evaluation tools!
- LLaVA-Video-178K: we train our model with the data from LLaVA-Video-178k.