Skip to content

Conversation

@yaswanth19
Copy link
Contributor

@yaswanth19 yaswanth19 commented Nov 2, 2025

Closes #36036

@Rocketknight1
Copy link
Member

cc @zucchini-nlp when you get a chance!

@zucchini-nlp
Copy link
Member

Phi3.5 💀 I will take a look some time this week

@yaswanth19
Copy link
Contributor Author

@zucchini-nlp PR should be ready for review next week. Will ping you at that time.

@yaswanth19
Copy link
Contributor Author

yaswanth19 commented Nov 10, 2025

@zucchini-nlp PR is ready for initial review 🤗. CI has been broken from weekend but it was all green previously so no issues there.

Comment on lines +246 to +256
def get_image_features(self, pixel_values: torch.Tensor, image_sizes, num_images, num_crops):
# Process image using CLIP model.
vision_outputs = self.vision_model(pixel_values, output_hidden_states=True)

# Extract the hidden states from the second last layer.
hidden_state = vision_outputs.hidden_states[-2][:, 1:]
hidden_state = hidden_state.reshape(num_images, num_crops, -1, self.image_dim_out)

# Transform the image features to text embedding space.
image_features = self.transform_image_embeds(hidden_state, image_sizes)
return image_features
Copy link
Contributor Author

@yaswanth19 yaswanth19 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Majorly because of this function where the image features are transformed and projected in a bit non-standard way, also the utilization of image sizes makes it difficult to run inference with num_return_sequence>1 as then the image inputs are not synced with the repeated input_ids. Thus appropriate tests like beam search and any other tests are skipped.

Comment on lines 136 to 151
for prompt in text:
prompt_splits = re.split(r"(\<\|image\|\>)", prompt)

tokenized_outputs = []
for split in prompt_splits:
if split == "<|image|>":
if image_token_counter >= len(num_image_tokens):
raise ValueError("More image placeholders in the text than images provided.")
image_tokens = [self.image_token_id] * num_image_tokens[image_token_counter]
tokenized_outputs.extend(image_tokens)
image_token_counter += 1
else:
text_tokens = self.tokenizer(split)["input_ids"]
tokenized_outputs.extend(text_tokens)

tokenized_prompts.append(tokenized_outputs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of custom tokenization and the creation of subsequent attention mask, it's difficult to support assisted decoding and IMO it's not a super important generation method to support in this case.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaswanth19 thanks for the PR, looks much cleaner already!

I left some comments, mostly nitty-picking for better standardization. Also I believe there's one test failing with Phi3.5V :)

processor_dict = self.prepare_processor_dict()
self.assertTrue(processor_loaded.chat_template == processor_dict.get("chat_template", None))

@unittest.skip("Not possible as processor creates a custom attention mask.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mask format doesn't look custom, even though prepared manually instead of passing to tokenizer

Copy link
Contributor Author

@yaswanth19 yaswanth19 Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am skipping this test because it requires offset mapping which is quite difficult to fetch because of the way we tokenize the prompt.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, phi3_v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants