-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[Refactor] simplify multimodal data processing #8107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Refactor] simplify multimodal data processing #8107
Conversation
This commit refactors how multimodal data is handled, promoting a more organized and maintainable structure. - Consolidates model-specific data within a dictionary inside MultimodalDataItem, reducing code duplication and improving data organization. - Removes the qwen_audio.py processor as part of the refactoring. Signed-off-by: Xinyuan Tong <[email protected]>
this has quite some duplication with #7924. Can we prioritize one? |
Signed-off-by: Xinyuan Tong <[email protected]>
Yes we could first prioritize that PR |
Signed-off-by: Xinyuan Tong <[email protected]>
…dule_batch.py and base_processor.py Signed-off-by: Xinyuan Tong <[email protected]>
Moves model-specific data from individual `MultimodalDataItem` attributes to a dictionary within `model_specific_data` for better organization and maintainability. This change simplifies access to model-specific features during processing. Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Updated multiple image processor classes to use a shared `mm_tokens` attribute for multimodal token management, improving consistency and reducing redundancy in the codebase. This change enhances maintainability and simplifies the processing of multimodal data. Signed-off-by: Xinyuan Tong <[email protected]>
Updated type hints for the `data_iterators` parameter and `all_collected_items` variable in the BaseMultimodalProcessor class to improve code clarity and type safety. This change aids in better understanding and maintainability of the multimodal processing logic. Signed-off-by: Xinyuan Tong <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
Updated various model files to replace direct dictionary access with the get method for retrieving values from model_specific_data. This change enhances code robustness by preventing potential KeyErrors and improves consistency across the codebase. Signed-off-by: Xinyuan Tong <[email protected]>
This simplifies the processor interfaces and removes a parameter that is no longer needed in the processing logic. Signed-off-by: Xinyuan Tong <[email protected]>
Moves multimodal data attributes to the `model_specific_data` dictionary within `MultimodalDataItem`. This change improves code organization and flexibility by encapsulating model-specific data, such as image sizes and attention masks, within a dedicated dictionary. It also prepares the codebase for easier extension to support diverse multimodal models with varying data requirements. Signed-off-by: Xinyuan Tong <[email protected]>
Refactors the audio processing logic in Qwen2 to improve efficiency and readability. - Pre-collects and stores special token IDs to avoid redundant lookups. - Extracts and consolidates multimodal token information. - Simplifies the data processing flow. Signed-off-by: Xinyuan Tong <[email protected]>
Updates multimodal processors to correctly utilize and pass the image token ID. This change ensures that the `image_token_id` is consistently accessed from the `mm_tokens` object instead of directly from the processor, leading to more reliable and maintainable code. It also initializes image_token_id in the MultimodalSpecialTokens for processors that need it. Signed-off-by: Xinyuan Tong <[email protected]>
This update changes references from `precomputed_features` to `precomputed_embeddings` across various modules, including the VLM input format, multimodal data item, and processor classes. This refactor enhances consistency in the handling of multimodal data and improves code clarity. Signed-off-by: Xinyuan Tong <[email protected]>
This update removes the `image_sizes` attribute from the `MultimodalDataItem` class and clarifies the purpose of `precomputed_embeddings`. These changes enhance the organization of multimodal data handling and improve code clarity. Signed-off-by: Xinyuan Tong <[email protected]>
- Added `items` method to `DictOutput` class for better dictionary-like behavior. - Renamed `images` attribute to `pixel_values` in `VLChatProcessorOutput` for consistency. - Updated `DeepseekVL2ForCausalLM` to use `images_spatial_crop` instead of `image_spatial_crop`. - Modified `DeepseekVL2ImageProcessor` to include `max_req_input_len` in the processing method. - Adjusted multimodal processor calls to pass additional parameters for improved flexibility. These changes streamline the multimodal data processing and enhance the overall organization of the codebase. Signed-off-by: Xinyuan Tong <[email protected]>
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant refactoring of the multimodal data processing pipeline. Centralizing model-specific data into the model_specific_data
dictionary in MultimodalDataItem
simplifies the data structures and improves extensibility.
Identified a high-severity issue in MultimodalDataItem.merge
which currently doesn't handle merging of the new model_specific_data
, and medium-severity issues related to code duplication and a minor simplification opportunity.
Refactors how multimodal models access data within `MultimodalDataItem`. - Adds `__getitem__`, `get`, `__setitem__`, and `set` methods to `MultimodalDataItem` for more intuitive and flexible data access. - Updates multimodal models to use these new methods instead of directly accessing `model_specific_data`. This change improves code readability and maintainability by providing a consistent interface for accessing data associated with multimodal items. Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Replaces direct attribute access with the new `get` method for `aspect_ratio_id` and `aspect_ratio_mask` in the `MllamaForConditionalGeneration` class. This change enhances code readability and aligns with recent refactoring efforts in the multimodal data handling. Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Commented out the TestMllamaServer class and its methods in test_vision_openai_server_a.py, indicating that Mllama is not stable for CI. This change prevents potential failures in the test suite while maintaining the code for future use. Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
…attr__` for model-specific data. This change improves code readability and conciseness in multimodal models like DeepseekVL2, Gemma3n, KimiVL, Llava, MiniCPMO, MiniCPMv, Mistral, Mllama, Phi4MM, Qwen2, and Qwen2Audio by allowing direct access to data attributes instead of using `item.get("key")`. It also fixes a bug in Phi4MMProcessorAdapter where the original hf_key was not being deleted from the result, leading to incorrect assignment. Signed-off-by: Xinyuan Tong <[email protected]>
====================== video_response ===================== | ||
{video_response} | ||
=========================================================== | ||
should contain 'iPod' or 'device' or 'microphone' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better not hard code this
This pull request aims to refactor and simplify the multimodal data processing within the system. The core change involves centralizing model-specific data into a single dictionary within the MultimodalDataItem class, reducing boilerplate and improving extensibility. This refactoring impacts how various multimodal models and their processors handle and access their specific input features.
Highlights
Checklist