Following LLaVA v1.5, we add grounding data and visual question-answering (VQA) data into the training dataset, enhancing the model's reasoning capabilities.
Download the training annotations. You can download from https://huggingface.co/datasets/Chat-UniVi/Chat-UniVi-Instruct/tree/main/v1.5_train_json .
Datasets Baidu Disk
Image pretraining (From LLaVA v1.5) Link
Image tuning (From LLaVA v1.5) Link
Video pretraining (From Valley) Link
Stage1: Multimodal Pre-training
deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=29602 \
ChatUniVi/train/train_mem.py \
--deepspeed scripts/zero3.json \
--model_name_or_path ${LLM model path} \
--version v1 \
--model_use PRETUNE \
--dataset_use Pretrainv1.5 \
--vision_tower openai/clip-vit-large-patch14-336 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${stage1 save path} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 24000 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
Stage2: Joint Instruction Tuning
deepspeed \
--include localhost:0,1,2,3,4,5,6,7 \
--master_port=29601 \
ChatUniVi/train/train_mem.py \
--deepspeed scripts/zero2.json \
--model_name_or_path ${LLM model path} \
--version v1 \
--model_use FINETUNE \
--dataset_use FINETUNEv1.5 \
--vision_tower openai/clip-vit-large-patch14-336 \
--pretrain_mm_mlp_adapter ${stage1 save path}/mm_projector.bin \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ${stage2 save path} \
--num_train_epochs 2 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
Image Understanding Benchmarks
Methods LLM Visual Tokens VQA v2 GQA VisWiz SQA I VQA T POPE MMB LLaVA W MM-Vet
LLaVA v1.5 Vicuna-7B 576 78.5 62.0 50.0 66.8 58.2 85.9 64.3 63.4 30.5
Video-LLaVA Vicuna-7B 256 74.7 60.3 48.1 66.4 51.8 84.4 60.9 73.1 32.0
Chat-UniVi-7B v1.5 Vicuna-7B 112 75.4 59.6 44.2 68.1 53.0 85.4 62.7 64.3 28.3
Methods LLM Size MSRVTT-QA MSVD-QA TGIF-QA ActivityNet-QA
Accuracy Score Accuracy Score Accuracy Score Accuracy Score
Video-LLaMA 7B 29.6 1.8 51.6 2.5 - - 12.4 1.1
LLaMA-Adapter 7B 43.8 2.7 54.9 3.1 - - 34.2 2.7
VideoChat 7B 45.0 2.5 56.3 2.8 34.4 2.3 26.5 2.2
Video-ChatGPT 7B 49.3 2.8 64.9 3.3 51.4 3.0 35.2 2.7
Video-LLaVA 7B 59.2 3.5 70.7 3.9 70.0 4.0 45.3 3.3
Chat-UniVi-7B 7B 54.6 3.1 65.0 3.6 60.3 3.4 45.8 3.2
Chat-UniVi-7B with new video loading code7B 55.0 3.1 69.3 3.7 69.0 3.8 46.1 3.3
Chat-UniVi-7B v1.5 7B 57.5 3.2 68.8 3.7 70.0 3.8 47.2 3.3
Methods LLM Size Random Popular Adversarial
Accuracy F1-Score Yes Accuracy F1-Score Yes Accuracy F1-Score Yes
LLaVA 7B 72.16 78.22 76.29 61.37 71.52 85.63 58.67 70.12 88.33
Video-LLaVA 7B 86.2 85.2 42.0 85.3 84.0 42.1 81.6 80.8 45.8
Chat-UniVi-7B 7B 85.19 86.05 54.67 69.50 74.39 69.10 64.97 71.54 73.10
Chat-UniVi-7B v1.5 7B 87.01 86.09 41.86 85.87 84.76 42.73 83.23 82.31 44.77