[transformers snippet] Support pipeline VLMs #1012

mishig25 · 2024-11-04T09:46:18Z

Description

Rn, if you go to conversational VLM like Llama-3.2-11B-Vision-Instruct, you would not receive high-level pipeline snippet.
On the contrary, if you go to conversational LLM like Llama-3.1-8B-Instruct, you will receive high-level pipeline snippet.

huggingface/transformers#34170 was merged. Therefore, now we can add pipeline snippet for VLM.

Here is an example snippet, one would get

# Use a pipeline as a high-level helper
from transformers import pipeline

image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"

pipe = pipeline("image-text-to-text", model="meta-llama/Llama-3.2-11B-Vision-Instruct")
pipe(
    images=[image_ny, image_chicago],
    text="<image> <image> Are these the same cities? If not what cities are these?",
)

pcuenca

Has huggingface/transformers#34170 been released? If it hasn't, should we wait until the next transformers release before merging or is it ok?

pcuenca · 2024-11-04T12:33:46Z

packages/tasks/src/model-libraries-snippets.ts

+					`image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"`,
+					`image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"`,


The example is super nice, but not all image-text-to-text models support multiple images reliably. I'd go for a simpler single-image example for now.

yonigozlan · 2024-11-04T12:57:16Z

The example is super nice, but not all image-text-to-text models support multiple images reliably. I'd go for a simpler single-image example for now.

Agreed with this! another problem is that although most models use the <image> token, some models use a different token. For example, Pixtral uses [IMG], and mllama uses <|image|>, so this snippet would not work with these models.

However since you have this check:

if (model.tags.includes("conversational") && model.config?.tokenizer_config?.chat_template) {

It would be even better and simpler to use chat template for vlms also, with something like this:

image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s the difference between these two images?"},
            {"type": "image"},
            {"type": "image"},
        ],
    }
]
outputs = pipe([image_ny, image_chicago], text=messages)

merveenoyan · 2024-11-07T18:11:21Z

I completely agree with @yonigozlan best to have chat template

[transformers snippet] Support VLMs

17e747a

mishig25 marked this pull request as ready for review November 4, 2024 09:56

mishig25 requested review from SBrandeis, gary149, Wauplin, julien-c, pcuenca and ngxson as code owners November 4, 2024 09:56

mishig25 requested review from LysandreJik and yonigozlan November 4, 2024 09:56

mishig25 changed the title ~~[transformers snippet] Support VLMs~~ [transformers snippet] Support pipeline VLMs Nov 4, 2024

pcuenca reviewed Nov 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[transformers snippet] Support pipeline VLMs #1012

[transformers snippet] Support pipeline VLMs #1012

mishig25 commented Nov 4, 2024 •

edited

Loading

pcuenca left a comment

pcuenca Nov 4, 2024

yonigozlan commented Nov 4, 2024 •

edited

Loading

merveenoyan commented Nov 7, 2024

		`image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"`,
		`image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"`,

[transformers snippet] Support pipeline VLMs #1012

Are you sure you want to change the base?

[transformers snippet] Support pipeline VLMs #1012

Conversation

mishig25 commented Nov 4, 2024 • edited Loading

Description

pcuenca left a comment

Choose a reason for hiding this comment

pcuenca Nov 4, 2024

Choose a reason for hiding this comment

yonigozlan commented Nov 4, 2024 • edited Loading

merveenoyan commented Nov 7, 2024

mishig25 commented Nov 4, 2024 •

edited

Loading

yonigozlan commented Nov 4, 2024 •

edited

Loading