Skip to content

Commit 2a467bc

Browse files
Merge branch 'develop'
2 parents 7f00189 + 19b5525 commit 2a467bc

File tree

6 files changed

+32
-17
lines changed

6 files changed

+32
-17
lines changed

README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ In this tool, a **describer** is a backend for a family of vision language model
3030

3131
![Workflow](doc/bvqa-workflow.jpg)
3232

33-
[Example of a report on test data with various vision language models](https://github.com/kingsdigitallab/kdl-vqa/blob/main/doc/bvqa-tests-2025-03-07.pdf)
33+
[Example of a report on test data with various vision language models](https://github.com/kingsdigitallab/kdl-vqa/blob/main/doc/bvqa-tests-2025-03-11.pdf)
3434

3535
## Requirements
3636

@@ -107,7 +107,8 @@ A describer is a backend for bvqa that provide support for a family of vision la
107107
| qwen-vl | [Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4) | 2b:int4 | 7 | 4:53 | unlimited |
108108
| qwen-vl | [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) -o | 3b:BF16 | 21 | 2:49 | |
109109
| qwen-vl | [allenai/olmOCR-7B-0225-preview](https://huggingface.co/allenai/olmOCR-7B-0225-preview) -o | 7b:BF16 | 24 | 3:21 | |
110-
| ovis | [AIDC-AI/Ovis2-1B](https://huggingface.co/AIDC-AI/Ovis2-1B) | 1b:BF16 | 3 | 0:42 | |
110+
| ovis | [AIDC-AI/Ovis2-1B](https://huggingface.co/AIDC-AI/Ovis2-1B) -o | 1b:BF16 | 3 | 0:42 | |
111+
| ovis | [AIDC-AI/Ovis2-4B](https://huggingface.co/AIDC-AI/Ovis2-4B) -o | 4b:BF16 | 10 | 1:01 | |
111112
| ollama | [llama3.2-vision](https://ollama.com/library/llama3.2-vision) | 11b:Q4_K_M | 12 | 0:59 | |
112113
| ollama | [minicpm-v](https://ollama.com/library/minicpm-v) | 8b:Q4_0 | 7 | 1:28 | |
113114
| ollama | [granite3.2-vision](https://ollama.com/library/granite3.2-vision) | 2b:Q4_K_M | 13 | UNRESPONSIVE | |
@@ -140,7 +141,7 @@ For those describers, the models refer to model names on the Hugging Face hub. I
140141

141142
**Qwen** models can crash as they eat up extraordinary amount of VRAM. To keep it under control use the `-o` flag with your `describe` action. It will use flash_attention to drastically reduce memory use. However the flash attention libraries need more recent generations of GPUs. The use -o flag is documented in the model column of the above table.
142143

143-
**ovis** despite being small, fast and using very little VRAM, this model requires more recent GPUs due to the reliance on flash_attn package which we found often difficult to install or run on various machines.
144+
**ovis** also greatly benefits from `-o` (flash attention), reducing the VRAM use by 3x.
144145

145146
## Reviewing (`report`)
146147

@@ -205,7 +206,8 @@ You can combine this with the -f option to test on a few images only.
205206

206207
The -r option tells the tool to ignore the cache.
207208
When supplied, it will always ask the questions again.
208-
This is useful in the case where you want to compare the performance between different computing devices (e.g. Nvidia A100 vs L40s GPUs) to estimate the total duration on your entire collection.
209+
This is useful in the case where you want to compare the performance between different computing devices
210+
(e.g. Nvidia A100 vs L40s GPUs) to estimate the total duration on your entire collection.
209211

210212
## Parallelism
211213

@@ -262,7 +264,7 @@ After running your questions on a larger proportion of your collection, you migh
262264
As prompt engineering is usually very model-specific, moving to another model can be very disruptive.
263265
It aways mean reassessing the answers and often means reformulating many questions from scratch.
264266

265-
## Design principles
267+
## Guiding principles
266268

267269
* Reproducibility
268270
* Ease of use

bvqa.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,8 @@ def action_describe(self):
141141
'''Submit questions about multiple images to a visual model & save answers.'''
142142
self.new_describer()
143143

144+
print(f'Describe with describer = {self.describer_name} ; model = {self.describer.get_name()}.')
145+
144146
self.timer.step(f'model: {self.describer.get_name()}')
145147
import socket
146148
self.timer.step(f'host : {socket.gethostname()}')

describer/ovis.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,14 @@
1010

1111

1212
class Ovis(ImageDescriber):
13-
"""Image description using SmolVLM model.
13+
"""Image description using Ovis model.
1414
15-
https://huggingface.co/AIDC-AI/Ovis2-1B
1615
1.27B params, BF16
16+
17+
https://huggingface.co/AIDC-AI/Ovis2-1B
18+
https://github.com/AIDC-AI/Ovis
19+
20+
Works on CPU but extremely slow, even 1B model.
1721
"""
1822

1923
def __init__(self, model_id='', model_version=''):
@@ -88,21 +92,25 @@ def _init_model(self):
8892
print('WARNING: running model on CPU')
8993
self._new_model()
9094

91-
from transformers import AutoProcessor
92-
self.processor = AutoProcessor.from_pretrained(self.model_id)
93-
9495
return self.model
9596

9697
def _new_model(self, use_cuda=False, use_attention=False):
9798
from transformers import AutoModelForCausalLM
9899
import torch
99100

100-
self.model = AutoModelForCausalLM.from_pretrained(
101-
self.model_id,
101+
options = dict(
102102
torch_dtype=torch.bfloat16,
103103
multimodal_max_length=32768,
104-
trust_remote_code=True
104+
trust_remote_code=True,
105105
)
106+
107+
# https://github.com/AIDC-AI/Ovis/issues/64#issuecomment-2686944605
108+
if not use_cuda:
109+
options['device_map'] = 'cpu'
110+
if not use_attention:
111+
options['llm_attn_implementation'] = 'eager'
112+
113+
self.model = AutoModelForCausalLM.from_pretrained(self.model_id, **options)
106114
if use_cuda:
107115
self.model = self.model.cuda()
108116

test/data/test_cases.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@
1313
"location": ["indoor"]
1414
},
1515
"form": {
16-
"long_description": ["Streetcar Magazine"],
16+
"long_description": ["streetcar museum"],
1717
"location": ["N/A"],
18-
"text": ["Tour", "Recital", "50% off with SBO", "renew old memories"]
18+
"text": ["tour", "recital", "50% off with sbo", "renew old memories"]
1919
}
2020
}

test/describe/all.bash

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,9 @@ cd "$(dirname "$0")"
22
bash describer.bash moondream vikhyatk/moondream2
33
CUDA_VISIBLE_DEVICES=0 bash describer.bash qwen-vl Qwen/Qwen2.5-VL-3B-Instruct
44
bash describer.bash smol HuggingFaceTB/SmolVLM-Instruct
5-
bash describer.bash ollama llama3.2-vision
5+
bash describer.bash ovis AIDC-AI/Ovis2-1B
6+
# TODO: understand why it hangs on 'object' qst for susan-q image
7+
# bash describer.bash ollama llama3.2-vision
8+
bash describer.bash ollama minicpm-v
9+
cd ../..
610
python3 bvqa.py report -R test/data -t

utils/helpers.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,6 @@ def get_repeat_ratio(answer):
103103
while True:
104104
if len(words) < 2*l: break
105105
if ' '.join(words[-l:]) == ' '.join(words[-2*l:-l]):
106-
print(words[-l:])
107106
ret = 1.0
108107
break
109108
l += 1

0 commit comments

Comments
 (0)