Skip to content

[Refactor] simplify multimodal data processing #8107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

JustinTong0323
Copy link
Collaborator

@JustinTong0323 JustinTong0323 commented Jul 17, 2025

This pull request aims to refactor and simplify the multimodal data processing within the system. The core change involves centralizing model-specific data into a single dictionary within the MultimodalDataItem class, reducing boilerplate and improving extensibility. This refactoring impacts how various multimodal models and their processors handle and access their specific input features.

Highlights

  • Centralized Multimodal Data Storage: The MultimodalDataItem class has been refactored to consolidate model-specific attributes (e.g., image_grid_thw, audio_feature_lens) into a new model_specific_data dictionary, simplifying the data structure and making it more generic.
  • Streamlined Multimodal Processors: Multimodal processors (e.g., CLIP, Deepseek-VL, Mllama) have been updated to utilize the new model_specific_data structure, reducing redundant code and improving consistency in how data is prepared for models.
  • Model Adaptations: Numerous model implementations (e.g., Deepseek-VL2, Gemma3N, Kimi-VL, MiniCPM, Phi4MM, Qwen2.5-VL, Qwen2-Audio, Qwen2-VL) have been adjusted to retrieve their specific input features from the model_specific_data dictionary.
  • Processor Utility Enhancements: The BaseMultimodalProcessor now includes improved logic for collecting and combining multimodal data, leveraging the new centralized data structure.
  • File Renaming/Restructuring: The qwen_audio.py processor file has been moved and renamed, indicating a potential restructuring of the multimodal processor directory.

Checklist

This commit refactors how multimodal data is handled, promoting a more organized and maintainable structure.

- Consolidates model-specific data within a dictionary inside MultimodalDataItem, reducing code duplication and improving data organization.
- Removes the qwen_audio.py processor as part of the refactoring.

Signed-off-by: Xinyuan Tong <[email protected]>
gemini-code-assist[bot]

This comment was marked as outdated.

gemini-code-assist[bot]

This comment was marked as outdated.

@mickqian
Copy link
Collaborator

this has quite some duplication with #7924. Can we prioritize one?

@JustinTong0323
Copy link
Collaborator Author

this has quite some duplication with #7924. Can we prioritize one?

Yes we could first prioritize that PR

@JustinTong0323 JustinTong0323 changed the title [Refactor] simplify multimodal data processing [WIP][Refactor] simplify multimodal data processing Jul 17, 2025
JustinTong0323 and others added 5 commits July 16, 2025 21:39
Signed-off-by: Xinyuan Tong <[email protected]>
…dule_batch.py and base_processor.py

Signed-off-by: Xinyuan Tong <[email protected]>
Moves model-specific data from individual `MultimodalDataItem` attributes
to a dictionary within `model_specific_data` for better organization
and maintainability. This change simplifies access to model-specific
features during processing.

Signed-off-by: Xinyuan Tong <[email protected]>
gemini-code-assist[bot]

This comment was marked as outdated.

Xinyuan Tong added 3 commits July 19, 2025 00:39
Updated multiple image processor classes to use a shared `mm_tokens` attribute for multimodal token management, improving consistency and reducing redundancy in the codebase. This change enhances maintainability and simplifies the processing of multimodal data.

Signed-off-by: Xinyuan Tong <[email protected]>
Updated type hints for the `data_iterators` parameter and `all_collected_items` variable in the BaseMultimodalProcessor class to improve code clarity and type safety. This change aids in better understanding and maintainability of the multimodal processing logic.

Signed-off-by: Xinyuan Tong <[email protected]>

This comment was marked as outdated.

Xinyuan Tong added 2 commits July 19, 2025 01:12
Updated various model files to replace direct dictionary access with the get method for retrieving values from model_specific_data. This change enhances code robustness by preventing potential KeyErrors and improves consistency across the codebase.

Signed-off-by: Xinyuan Tong <[email protected]>
This simplifies the processor interfaces and removes a parameter
that is no longer needed in the processing logic.

Signed-off-by: Xinyuan Tong <[email protected]>
@JustinTong0323 JustinTong0323 changed the title [WIP][Refactor] simplify multimodal data processing [Refactor] simplify multimodal data processing Jul 19, 2025
Xinyuan Tong added 6 commits July 19, 2025 04:48
Moves multimodal data attributes to the `model_specific_data` dictionary within `MultimodalDataItem`.

This change improves code organization and flexibility by encapsulating model-specific data, such as image sizes and attention masks, within a dedicated dictionary. It also prepares the codebase for easier extension to support diverse multimodal models with varying data requirements.

Signed-off-by: Xinyuan Tong <[email protected]>
Refactors the audio processing logic in Qwen2 to improve efficiency and readability.

- Pre-collects and stores special token IDs to avoid redundant lookups.
- Extracts and consolidates multimodal token information.
- Simplifies the data processing flow.

Signed-off-by: Xinyuan Tong <[email protected]>
Updates multimodal processors to correctly utilize and pass the image token ID.

This change ensures that the `image_token_id` is consistently accessed from the `mm_tokens` object instead of directly from the processor, leading to more reliable and maintainable code. It also initializes image_token_id in the MultimodalSpecialTokens for processors that need it.

Signed-off-by: Xinyuan Tong <[email protected]>
This update changes references from `precomputed_features` to `precomputed_embeddings` across various modules, including the VLM input format, multimodal data item, and processor classes. This refactor enhances consistency in the handling of multimodal data and improves code clarity.

Signed-off-by: Xinyuan Tong <[email protected]>
This update removes the `image_sizes` attribute from the `MultimodalDataItem` class and clarifies the purpose of `precomputed_embeddings`. These changes enhance the organization of multimodal data handling and improve code clarity.

Signed-off-by: Xinyuan Tong <[email protected]>
- Added `items` method to `DictOutput` class for better dictionary-like behavior.
- Renamed `images` attribute to `pixel_values` in `VLChatProcessorOutput` for consistency.
- Updated `DeepseekVL2ForCausalLM` to use `images_spatial_crop` instead of `image_spatial_crop`.
- Modified `DeepseekVL2ImageProcessor` to include `max_req_input_len` in the processing method.
- Adjusted multimodal processor calls to pass additional parameters for improved flexibility.

These changes streamline the multimodal data processing and enhance the overall organization of the codebase.

Signed-off-by: Xinyuan Tong <[email protected]>
@JustinTong0323
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant refactoring of the multimodal data processing pipeline. Centralizing model-specific data into the model_specific_data dictionary in MultimodalDataItem simplifies the data structures and improves extensibility.

Identified a high-severity issue in MultimodalDataItem.merge which currently doesn't handle merging of the new model_specific_data, and medium-severity issues related to code duplication and a minor simplification opportunity.

Xinyuan Tong and others added 11 commits July 19, 2025 09:18
Refactors how multimodal models access data within `MultimodalDataItem`.

- Adds `__getitem__`, `get`, `__setitem__`, and `set` methods to `MultimodalDataItem` for more intuitive and flexible data access.
- Updates multimodal models to use these new methods instead of directly accessing `model_specific_data`.

This change improves code readability and maintainability by providing a consistent interface for accessing data associated with multimodal items.

Signed-off-by: Xinyuan Tong <[email protected]>
Replaces direct attribute access with the new `get` method for `aspect_ratio_id` and `aspect_ratio_mask` in the `MllamaForConditionalGeneration` class. This change enhances code readability and aligns with recent refactoring efforts in the multimodal data handling.

Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Commented out the TestMllamaServer class and its methods in test_vision_openai_server_a.py, indicating that Mllama is not stable for CI. This change prevents potential failures in the test suite while maintaining the code for future use.

Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
…attr__` for model-specific data.

This change improves code readability and conciseness in multimodal models like DeepseekVL2, Gemma3n, KimiVL, Llava, MiniCPMO, MiniCPMv, Mistral, Mllama, Phi4MM, Qwen2, and Qwen2Audio by allowing direct access to data attributes instead of using `item.get("key")`. It also fixes a bug in Phi4MMProcessorAdapter where the original hf_key was not being deleted from the result, leading to incorrect assignment.

Signed-off-by: Xinyuan Tong <[email protected]>
====================== video_response =====================
{video_response}
===========================================================
should contain 'iPod' or 'device' or 'microphone'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better not hard code this

@JustinTong0323 JustinTong0323 added high priority MLLM multi-modal language model labels Jul 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority MLLM multi-modal language model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants