Skip to content

Conversation

milesial
Copy link

@milesial milesial commented Oct 15, 2025

(WIP)

Overview:

Media decoding in the frontend for VLMs.

Details:

Decodes multimodal data from the OAI chat request (image_url, video_url) in the frontend processor into decoded tensors (pixel values).
Passes the decoded data to the next step in the graph (backend) via NIXL readable descriptors.

Decoding data involves:

  • Potentially fetching the data from the web
  • Potentially decoding base64
  • Running the actual media decoding (JPEG, H264, ...)

These last two steps can be CPU-heavy and are done in the rayon runtime.
This decoding is optional, if dynamo was not built with this feature, or if no decoding configuration is passed, unprocessed URLs will be passed.

Preprocessor holds a MediaLoader, which has an HTTP client and media decoders for each modality. Decoder configuration is passed via the MDC. In the future, per-request or even per-item options could override this default configuration.

TODOs:

  • Have media decoding code under a feature flag
  • NIXL descriptors
  • Unit tests
  • Microbench tests
  • Per-request decoder options
  • HW decoding

Where should the reviewer start?

Flow starting from gather_multi_modal_data in preprocessor.rs

Copy link

copy-pr-bot bot commented Oct 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant