Inference is surprisingly slower than the normal Qwen/Qwen2.5-VL-7B-Instruct model

Firstly, the paper is pretty cool. I tried to test the `OpenGVLab/VideoChat-R1_7B` model on a custom dataset of 30s videos. I found that the inference is around 6x slower than the normal `Qwen/Qwen2.5-VL-7B-Instruct` model although I'm assuming both of them have the same base architecture. I'm using bitsandbytes 4bit double quantization for both. My dataset consists of roughly 5k 30s videos. Where the `OpenGVLab/VideoChat-R1_7B` model requires around 48hrs as the eta for inference, the  `Qwen/Qwen2.5-VL-7B-Instruct` only requires around 8-9hrs. These are my settings on which I tried:

```python
{
    "type": "video",
    "video": media_sample["path"],
    "resized_height": resized_height, # 560
    "resized_width": resized_width, # 560
    "fps": fps, # 3
}
```
This turns to about 90 frames for inference. I even tried the original settings on which you trained:
```python
{
    "type": "video",
    "video": video_path,
    "max_pixels": 460800,
    "nframes": 32
}
```

But even the above settings has an eta of 25-25hrs. Kindly let me know if I've missed anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference is surprisingly slower than the normal Qwen/Qwen2.5-VL-7B-Instruct model #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference is surprisingly slower than the normal Qwen/Qwen2.5-VL-7B-Instruct model #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions