Skip to content

Inference is surprisingly slower than the normal Qwen/Qwen2.5-VL-7B-Instruct model #18

Open
@cs-mshah

Description

@cs-mshah

Firstly, the paper is pretty cool. I tried to test the OpenGVLab/VideoChat-R1_7B model on a custom dataset of 30s videos. I found that the inference is around 6x slower than the normal Qwen/Qwen2.5-VL-7B-Instruct model although I'm assuming both of them have the same base architecture. I'm using bitsandbytes 4bit double quantization for both. My dataset consists of roughly 5k 30s videos. Where the OpenGVLab/VideoChat-R1_7B model requires around 48hrs as the eta for inference, the Qwen/Qwen2.5-VL-7B-Instruct only requires around 8-9hrs. These are my settings on which I tried:

{
    "type": "video",
    "video": media_sample["path"],
    "resized_height": resized_height, # 560
    "resized_width": resized_width, # 560
    "fps": fps, # 3
}

This turns to about 90 frames for inference. I even tried the original settings on which you trained:

{
    "type": "video",
    "video": video_path,
    "max_pixels": 460800,
    "nframes": 32
}

But even the above settings has an eta of 25-25hrs. Kindly let me know if I've missed anything.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions