Conversation
pcuenca
left a comment
There was a problem hiding this comment.
Has huggingface/transformers#34170 been released? If it hasn't, should we wait until the next transformers release before merging or is it ok?
| `image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"`, | ||
| `image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"`, |
There was a problem hiding this comment.
The example is super nice, but not all image-text-to-text models support multiple images reliably. I'd go for a simpler single-image example for now.
Agreed with this! another problem is that although most models use the However since you have this check: if (model.tags.includes("conversational") && model.config?.tokenizer_config?.chat_template) {It would be even better and simpler to use chat template for vlms also, with something like this: image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "What’s the difference between these two images?"},
{"type": "image"},
{"type": "image"},
],
}
]
outputs = pipe([image_ny, image_chicago], text=messages) |
|
I completely agree with @yonigozlan best to have chat template |
|
I think this was finally handled by @Vaibhavs10 in #1434 (sorry, I did not remember about this PR when the other one was opened) |
Description
Rn, if you go to conversational VLM like Llama-3.2-11B-Vision-Instruct, you would not receive high-level pipeline snippet.
On the contrary, if you go to conversational LLM like Llama-3.1-8B-Instruct, you will receive high-level pipeline snippet.
huggingface/transformers#34170 was merged. Therefore, now we can add pipeline snippet for VLM.
Here is an example snippet, one would get