[Refactor] simplify multimodal data processing #8107

JustinTong0323 · 2025-07-17T03:24:07Z

This pull request aims to refactor and simplify the multimodal data processing within the system. The core change involves centralizing model-specific data into a single dictionary within the MultimodalDataItem class, reducing boilerplate and improving extensibility. This refactoring impacts how various multimodal models and their processors handle and access their specific input features.

Highlights

Centralized Multimodal Data Storage: The MultimodalDataItem class has been refactored to consolidate model-specific attributes (e.g., image_grid_thw, audio_feature_lens) into a new model_specific_data dictionary, simplifying the data structure and making it more generic.
Streamlined Multimodal Processors: Multimodal processors (e.g., CLIP, Deepseek-VL, Mllama) have been updated to utilize the new model_specific_data structure, reducing redundant code and improving consistency in how data is prepared for models.
Model Adaptations: Numerous model implementations (e.g., Deepseek-VL2, Gemma3N, Kimi-VL, MiniCPM, Phi4MM, Qwen2.5-VL, Qwen2-Audio, Qwen2-VL) have been adjusted to retrieve their specific input features from the model_specific_data dictionary.
Processor Utility Enhancements: The BaseMultimodalProcessor now includes improved logic for collecting and combining multimodal data, leveraging the new centralized data structure.
File Renaming/Restructuring: The qwen_audio.py processor file has been moved and renamed, indicating a potential restructuring of the multimodal processor directory.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

This commit refactors how multimodal data is handled, promoting a more organized and maintainable structure. - Consolidates model-specific data within a dictionary inside MultimodalDataItem, reducing code duplication and improving data organization. - Removes the qwen_audio.py processor as part of the refactoring. Signed-off-by: Xinyuan Tong <[email protected]>

mickqian · 2025-07-17T03:41:40Z

this has quite some duplication with #7924. Can we prioritize one?

Signed-off-by: Xinyuan Tong <[email protected]>

JustinTong0323 · 2025-07-17T04:37:59Z

this has quite some duplication with #7924. Can we prioritize one?

Yes we could first prioritize that PR

Signed-off-by: Xinyuan Tong <[email protected]>

…dule_batch.py and base_processor.py Signed-off-by: Xinyuan Tong <[email protected]>

Moves model-specific data from individual `MultimodalDataItem` attributes to a dictionary within `model_specific_data` for better organization and maintainability. This change simplifies access to model-specific features during processing. Signed-off-by: Xinyuan Tong <[email protected]>

Signed-off-by: Xinyuan Tong <[email protected]>

Updated multiple image processor classes to use a shared `mm_tokens` attribute for multimodal token management, improving consistency and reducing redundancy in the codebase. This change enhances maintainability and simplifies the processing of multimodal data. Signed-off-by: Xinyuan Tong <[email protected]>

Updated type hints for the `data_iterators` parameter and `all_collected_items` variable in the BaseMultimodalProcessor class to improve code clarity and type safety. This change aids in better understanding and maintainability of the multimodal processing logic. Signed-off-by: Xinyuan Tong <[email protected]>

Updated various model files to replace direct dictionary access with the get method for retrieving values from model_specific_data. This change enhances code robustness by preventing potential KeyErrors and improves consistency across the codebase. Signed-off-by: Xinyuan Tong <[email protected]>

This simplifies the processor interfaces and removes a parameter that is no longer needed in the processing logic. Signed-off-by: Xinyuan Tong <[email protected]>

Moves multimodal data attributes to the `model_specific_data` dictionary within `MultimodalDataItem`. This change improves code organization and flexibility by encapsulating model-specific data, such as image sizes and attention masks, within a dedicated dictionary. It also prepares the codebase for easier extension to support diverse multimodal models with varying data requirements. Signed-off-by: Xinyuan Tong <[email protected]>

Refactors the audio processing logic in Qwen2 to improve efficiency and readability. - Pre-collects and stores special token IDs to avoid redundant lookups. - Extracts and consolidates multimodal token information. - Simplifies the data processing flow. Signed-off-by: Xinyuan Tong <[email protected]>

Updates multimodal processors to correctly utilize and pass the image token ID. This change ensures that the `image_token_id` is consistently accessed from the `mm_tokens` object instead of directly from the processor, leading to more reliable and maintainable code. It also initializes image_token_id in the MultimodalSpecialTokens for processors that need it. Signed-off-by: Xinyuan Tong <[email protected]>

This update changes references from `precomputed_features` to `precomputed_embeddings` across various modules, including the VLM input format, multimodal data item, and processor classes. This refactor enhances consistency in the handling of multimodal data and improves code clarity. Signed-off-by: Xinyuan Tong <[email protected]>

This update removes the `image_sizes` attribute from the `MultimodalDataItem` class and clarifies the purpose of `precomputed_embeddings`. These changes enhance the organization of multimodal data handling and improve code clarity. Signed-off-by: Xinyuan Tong <[email protected]>

- Added `items` method to `DictOutput` class for better dictionary-like behavior. - Renamed `images` attribute to `pixel_values` in `VLChatProcessorOutput` for consistency. - Updated `DeepseekVL2ForCausalLM` to use `images_spatial_crop` instead of `image_spatial_crop`. - Modified `DeepseekVL2ImageProcessor` to include `max_req_input_len` in the processing method. - Adjusted multimodal processor calls to pass additional parameters for improved flexibility. These changes streamline the multimodal data processing and enhance the overall organization of the codebase. Signed-off-by: Xinyuan Tong <[email protected]>

JustinTong0323 · 2025-07-19T08:42:46Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant refactoring of the multimodal data processing pipeline. Centralizing model-specific data into the model_specific_data dictionary in MultimodalDataItem simplifies the data structures and improves extensibility.

Identified a high-severity issue in MultimodalDataItem.merge which currently doesn't handle merging of the new model_specific_data, and medium-severity issues related to code duplication and a minor simplification opportunity.

python/sglang/srt/managers/schedule_batch.py

python/sglang/srt/configs/deepseekvl2.py

python/sglang/srt/multimodal/processors/base_processor.py

python/sglang/srt/models/mllama.py

python/sglang/srt/managers/schedule_batch.py

python/sglang/srt/multimodal/processors/janus_pro.py

python/sglang/srt/multimodal/processors/mlama.py

python/sglang/srt/models/llava.py

Refactors how multimodal models access data within `MultimodalDataItem`. - Adds `__getitem__`, `get`, `__setitem__`, and `set` methods to `MultimodalDataItem` for more intuitive and flexible data access. - Updates multimodal models to use these new methods instead of directly accessing `model_specific_data`. This change improves code readability and maintainability by providing a consistent interface for accessing data associated with multimodal items. Signed-off-by: Xinyuan Tong <[email protected]>

Signed-off-by: Xinyuan Tong <[email protected]>

Replaces direct attribute access with the new `get` method for `aspect_ratio_id` and `aspect_ratio_mask` in the `MllamaForConditionalGeneration` class. This change enhances code readability and aligns with recent refactoring efforts in the multimodal data handling. Signed-off-by: Xinyuan Tong <[email protected]>

Signed-off-by: Xinyuan Tong <[email protected]>

Commented out the TestMllamaServer class and its methods in test_vision_openai_server_a.py, indicating that Mllama is not stable for CI. This change prevents potential failures in the test suite while maintaining the code for future use. Signed-off-by: Xinyuan Tong <[email protected]>

Signed-off-by: Xinyuan Tong <[email protected]>

…attr__` for model-specific data. This change improves code readability and conciseness in multimodal models like DeepseekVL2, Gemma3n, KimiVL, Llava, MiniCPMO, MiniCPMv, Mistral, Mllama, Phi4MM, Qwen2, and Qwen2Audio by allowing direct access to data attributes instead of using `item.get("key")`. It also fixes a bug in Phi4MMProcessorAdapter where the original hf_key was not being deleted from the result, leading to incorrect assignment. Signed-off-by: Xinyuan Tong <[email protected]>

mickqian · 2025-07-20T05:03:18Z

test/srt/test_vision_openai_server_common.py

+        ====================== video_response =====================
+        {video_response}
+        ===========================================================
+        should contain 'iPod' or 'device' or 'microphone'


better not hard code this

JustinTong0323 requested review from mickqian, merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu, zhaochenyang20 and xiezhq-hermann as code owners July 17, 2025 03:24

This comment was marked as outdated.

Sign in to view

revert modification for mllama4

75dd9fd

Signed-off-by: Xinyuan Tong <[email protected]>

JustinTong0323 changed the title ~~[Refactor] simplify multimodal data processing~~ [WIP][Refactor] simplify multimodal data processing Jul 17, 2025

JustinTong0323 and others added 5 commits July 16, 2025 21:39

fix lint

ce0dceb

Signed-off-by: Xinyuan Tong <[email protected]>

Merge branch 'main' into feat-simlify-multimodal-dataitem-processing

077163b

refactor: update type hints for model_specific_data and items in sche…

a5b1108

…dule_batch.py and base_processor.py Signed-off-by: Xinyuan Tong <[email protected]>

Merge branch 'main' into feat-simlify-multimodal-dataitem-processing

88a27e3

This comment was marked as outdated.

Sign in to view

Xinyuan Tong added 3 commits July 19, 2025 00:39

fix: restore change fot models/mllama.py

b764985

Signed-off-by: Xinyuan Tong <[email protected]>

This comment was marked as outdated.

Sign in to view

Xinyuan Tong added 2 commits July 19, 2025 01:12

Removes max_req_input_len from multimodal processor calls.

7bfb3f5

This simplifies the processor interfaces and removes a parameter that is no longer needed in the processing logic. Signed-off-by: Xinyuan Tong <[email protected]>

JustinTong0323 changed the title ~~[WIP][Refactor] simplify multimodal data processing~~ [Refactor] simplify multimodal data processing Jul 19, 2025

Merge branch 'main' into feat-simlify-multimodal-dataitem-processing

2eacc3e

Xinyuan Tong added 6 commits July 19, 2025 04:48

gemini-code-assist bot reviewed Jul 19, 2025

View reviewed changes

python/sglang/srt/managers/schedule_batch.py Show resolved Hide resolved

python/sglang/srt/configs/deepseekvl2.py Show resolved Hide resolved

python/sglang/srt/multimodal/processors/base_processor.py Outdated Show resolved Hide resolved

mickqian reviewed Jul 19, 2025

View reviewed changes

python/sglang/srt/models/llava.py Outdated Show resolved Hide resolved

Xinyuan Tong and others added 11 commits July 19, 2025 09:18

misc: remove redundant attr in MllamaImageProcessor

840ddfe

Signed-off-by: Xinyuan Tong <[email protected]>

fix lint

edc67d6

Signed-off-by: Xinyuan Tong <[email protected]>

fix llama atttr

d2ab9d4

Signed-off-by: Xinyuan Tong <[email protected]>

fix: qwen vl video attr error

d8776da

Signed-off-by: Xinyuan Tong <[email protected]>

fix lint

8360efd

Signed-off-by: Xinyuan Tong <[email protected]>

Merge branch 'main' into feat-simlify-multimodal-dataitem-processing

609c6f9

Merge branch 'main' into feat-simlify-multimodal-dataitem-processing

9691f9c

mickqian approved these changes Jul 20, 2025

View reviewed changes

mickqian reviewed Jul 20, 2025

View reviewed changes

JustinTong0323 added high priority MLLM multi-modal language model labels Jul 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Refactor] simplify multimodal data processing #8107

[Refactor] simplify multimodal data processing #8107

JustinTong0323 commented Jul 17, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

mickqian commented Jul 17, 2025

Uh oh!

JustinTong0323 commented Jul 17, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

JustinTong0323 commented Jul 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian Jul 20, 2025

Uh oh!

Uh oh!

[Refactor] simplify multimodal data processing #8107

Are you sure you want to change the base?

[Refactor] simplify multimodal data processing #8107

Conversation

JustinTong0323 commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

mickqian commented Jul 17, 2025

Uh oh!

JustinTong0323 commented Jul 17, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

JustinTong0323 commented Jul 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mickqian Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JustinTong0323 commented Jul 17, 2025 •

edited

Loading