Add Phi3.5 Vision Model #41977

yaswanth19 · 2025-11-02T15:34:47Z

Closes #36036

Rocketknight1 · 2025-11-05T12:18:48Z

cc @zucchini-nlp when you get a chance!

zucchini-nlp · 2025-11-05T12:57:33Z

Phi3.5 💀 I will take a look some time this week

yaswanth19 · 2025-11-05T14:30:13Z

@zucchini-nlp PR should be ready for review next week. Will ping you at that time.

yaswanth19 · 2025-11-10T10:44:27Z

@zucchini-nlp PR is ready for initial review 🤗. CI has been broken from weekend but it was all green previously so no issues there.

yaswanth19 · 2025-11-10T12:37:00Z

src/transformers/models/phi3_v/modular_phi3_v.py

+    def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops):
+        # Process image using CLIP model.
+        vision_outputs = self.vision_model(pixel_values, output_hidden_states=True)
+
+        # Extract the hidden states from the second last layer.
+        hidden_state = vision_outputs.hidden_states[-2][:, 1:]
+        hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out)
+
+        # Transform the image features to text embedding space.
+        image_features = self.transform_image_embeds(hidden_state, image_sizes)
+        return image_features


Majorly because of this function where the image features are transformed and projected in a bit non-standard way, also the utilization of image sizes makes it difficult to run inference with num_return_sequence>1 as then the image inputs are not synced with the repeated input_ids. Thus appropriate tests like beam search and any other tests are skipped.

yaswanth19 · 2025-11-10T12:40:34Z

src/transformers/models/phi3_v/processing_phi3_v.py

+        for prompt in text:
+            prompt_splits = re.split(r"(\<\|image\|\>)", prompt)
+
+            tokenized_outputs = []
+            for split in prompt_splits:
+                if split == "<|image|>":
+                    if image_token_counter >= len(num_image_tokens):
+                        raise ValueError("More image placeholders in the text than images provided.")
+                    image_tokens = [self.image_token_id] * num_image_tokens[image_token_counter]
+                    tokenized_outputs.extend(image_tokens)
+                    image_token_counter += 1
+                else:
+                    text_tokens = self.tokenizer(split)["input_ids"]
+                    tokenized_outputs.extend(text_tokens)
+
+            tokenized_prompts.append(tokenized_outputs)


Because of custom tokenization and the creation of subsequent attention mask, it's difficult to support assisted decoding and IMO it's not a super important generation method to support in this case.

zucchini-nlp

@yaswanth19 thanks for the PR, looks much cleaner already!

I left some comments, mostly nitty-picking for better standardization. Also I believe there's one test failing with Phi3.5V :)

docs/source/en/model_doc/phi3_v.md

src/transformers/models/auto/modeling_auto.py

src/transformers/models/phi3_v/convert_phi3_v_weights_to_hf.py

src/transformers/models/phi3_v/image_processing_phi3_v_fast.py

src/transformers/models/phi3_v/modular_phi3_v.py

tests/models/phi3_v/test_image_processing_phi3_v.py

zucchini-nlp · 2025-11-10T14:25:59Z

tests/models/phi3_v/test_processing_phi3_v.py

+        processor_dict = self.prepare_processor_dict()
+        self.assertTrue(processor_loaded.chat_template == processor_dict.get("chat_template", None))
+
+    @unittest.skip("Not possible as processor creates a custom attention mask.")


mask format doesn't look custom, even though prepared manually instead of passing to tokenizer

I am skipping this test because it requires offset mapping which is quite difficult to fetch because of the way we tokenize the prompt.

github-actions · 2025-11-15T17:37:55Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, phi3_v

yaswanth19 added 3 commits November 2, 2025 21:03

Add model

f280e53

Add model in auto

c6d9b89

processor tests passing

865bf5e

yaswanth19 and others added 23 commits November 6, 2025 22:45

Model is almost done

e95d6d0

Tests and cleanup

fa0f1ba

commit for now

26c6d0b

Merge branch 'main' into add-phi3-vision

89510c4

small fixes

711ac88

Add model tests

e5bdab4

Rename key

24918f2

make style

e27a37f

More tests

d9a5892

Merge branch 'main' into add-phi3-vision

a63903c

processor tests passing

55865ac

make fixup

3f2c293

Add docs

144e200

Merge branch 'main' into add-phi3-vision

51a3f5f

fix docstring

03f1885

up

ee5ab9f

Finally fix check_docstring

2b3e5dc

Add integrations tests

594901b

Update expectation in tests

fd6a5d3

up

79ba322

nit

1192d96

Merge branch 'main' into add-phi3-vision

13fe7ee

Remove prints

b731693

Fix failing text

23131a4

yaswanth19 commented Nov 10, 2025

View reviewed changes

Merge branch 'main' into add-phi3-vision

4485c3f

zucchini-nlp reviewed Nov 11, 2025

View reviewed changes

yaswanth19 and others added 6 commits November 11, 2025 20:00

doc update

9a5dbb8

up

f9ab6f0

Changes per review

b2ed0d2

update and run tests

197ac13

uncomment slow

f48018e

Merge branch 'main' into add-phi3-vision

edbea31

yaswanth19 added 2 commits November 15, 2025 23:19

Fix failing tests

4374c99

image_sizes is set directly

6ebdce5

Add Phi3.5 Vision Model #41977

Are you sure you want to change the base?

Add Phi3.5 Vision Model #41977

Conversation

yaswanth19 commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Nov 5, 2025

Uh oh!

zucchini-nlp commented Nov 5, 2025

Uh oh!

yaswanth19 commented Nov 5, 2025

Uh oh!

yaswanth19 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaswanth19 Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaswanth19 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zucchini-nlp Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

yaswanth19 Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaswanth19 commented Nov 2, 2025 •

edited

Loading

yaswanth19 commented Nov 10, 2025 •

edited

Loading

yaswanth19 Nov 10, 2025 •

edited

Loading

yaswanth19 Nov 15, 2025 •

edited

Loading