The official implementation of Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models (CVPR 2025)
git clone https://github.com/lntzm/HICom.git
cd HICom
conda create -n hicom python==3.10
conda activate hicom
conda install pytorch==2.4.1 torchvision==0.19.1 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install numpy==1.26.4
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
We put all our training and evaluation model under playground
folder. The structure are here:
playground
├── data
│ ├── eval_image -> /.../LLaVA/playground/data/eval # Link LLaVA eval folder here
│ ├── eval_video
│ │ ├── Activitynet_Zero_Shot_QA
│ │ ├── EgoSchema
│ │ ├── MLVU
│ │ ├── MSRVTT_Zero_Shot_QA
│ │ ├── MSVD_Zero_Shot_QA
│ │ ├── MVBench
│ │ ├── Video-ChatGPT-eval
│ │ └── Video-MME
│ ├── Ins-VL
│ ├── LLaVA-Instruct-150K
│ ├── LLaVA-Pretrain
│ └── Video_Mix_Instruct
│ ├── Charades
│ ├── CLEVER
│ ├── LLaVA-Hound
│ ├── LLaVA-Video-178K
│ ├── m4_instruct_videos
│ ├── mit_action
│ ├── NTU-RGB-D
│ ├── ssv2-cls
│ ├── TVQA
│ └── Video-ChatGPT-0525
└── models
├── Qwen2.5-0.5B-Instruct
├── Qwen2.5-1.5B-Instruct
├── Qwen2.5-7B-Instruct
└── siglip-so400m-patch14-384
Train scripts are under scripts/qwen2.5_7B
folder.
bash scripts/qwen2.5_7B/release/directg_local43_global32.sh
We release our trained checkpoint in Huggingface, which performs a little higher than reported, as we re-organize the code, fix some bugs, upgrade the environment, and re-train the model with unfreezing the text encoder.
video evaluation scripts are under scripts/eval/video
folder.
# videomme
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/eval/video/eval_video_mcqa_videomme.sh CKPT_PATH
# mvbench
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/eval/video/eval_video_mcqa_mvbench.sh CKPT_PATH
# egoschema
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/eval/video/eval_video_mcqa_egoschema.sh CKPT_PATH
# mlvu
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/eval/video/eval_video_mcqa_mlvu.sh CKPT_PATH
If you find our work useful for your research and applications, please cite using this BibTeX:
@article{liu2025hybrid,
title={Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models},
author={Liu, Zhihang and Xie, Chen-Wei and Li, Pandeng and Zhao, Liming and Tang, Longxiang and Zheng, Yun and Liu, Chuanbin and Xie, Hongtao},
journal={arXiv preprint arXiv:2503.16036},
year={2025}
}
The codebase of HICom is adapted from VideoLLaMA 2 and LLaVA-OneVision. We are also grateful for the following projects our HICom arise from: Qwen2.5, SigLIP, Panda-70M.
This project is released under the Apache 2.0 license as found in the LICENSE file. The service is a research preview intended for non-commercial use ONLY, subject to the model Licenses of LLaMA and Mistral, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please get in touch with us if you find any potential violations.