Description
Firstly, the paper is pretty cool. I tried to test the OpenGVLab/VideoChat-R1_7B
model on a custom dataset of 30s videos. I found that the inference is around 6x slower than the normal Qwen/Qwen2.5-VL-7B-Instruct
model although I'm assuming both of them have the same base architecture. I'm using bitsandbytes 4bit double quantization for both. My dataset consists of roughly 5k 30s videos. Where the OpenGVLab/VideoChat-R1_7B
model requires around 48hrs as the eta for inference, the Qwen/Qwen2.5-VL-7B-Instruct
only requires around 8-9hrs. These are my settings on which I tried:
{
"type": "video",
"video": media_sample["path"],
"resized_height": resized_height, # 560
"resized_width": resized_width, # 560
"fps": fps, # 3
}
This turns to about 90 frames for inference. I even tried the original settings on which you trained:
{
"type": "video",
"video": video_path,
"max_pixels": 460800,
"nframes": 32
}
But even the above settings has an eta of 25-25hrs. Kindly let me know if I've missed anything.