Description
I would like to add a Mixture of Modality Experts (MoME) block to nanoVLM in order to enable dynamic, modality‐specific routing between a small vision expert and a small text expert at each cross‐modal layer. The goal is to introduce a lightweight MoME architecture that allows it to learn richer modality‐specific representations.
Motivation
nanoVLM currently relies on a single shared transformer to process combined image and text inputs. Prior work shows that splitting modality processing into separate "experts" and routing between them can:
- Reduce cross-modal interference
- Enable modality-aware feature learning
- Provide interpretability via gating weights
Integrating a lightweight Mixture-of-Modality-Experts (MoME) block into nanoVLM will let it learn to “soft switch” between a vision expert and a text expert per input, improving robustness without significantly increasing parameters.
Potential benefits
-
Reduced Modality Interference
Separate routing of vision inputs through a Vision Expert and text inputs through a Text Expert prevents a single expert from juggling both modalities (MoME). -
Modality-Aware Routing
A small MLP router to decide how much weight to give to each expert (for every single input (every image+text pair)) (concatenating pooled vision/text embeddings → softmax) and multiply the output vector of each expert by its respective weight. -
Interpretable Gating
Logging "router weights" tell if an input was mostly handled by the Vision Expert or Text Expert. I²MoE suggests lightweight gating losses to encourage the router to use both experts in a meaningful way, rather than collapse to one .
Workflow
When you turn on MoME in nanoVLM, the model sends the picture through a small “vision expert” network and the words through a separate “text expert” network. Each expert distills its input into a representation of what it sees or reads. Then a tiny decision maker (the “router”) looks at both summaries and weighs each and blends them into one final representation. This representation goes into the usual nanoVLM head to make predictions. During training, the model learns both how to blend the representation and when to lean on vision versus text. Gating loss penalizes the model if it leans on one side.
Proposed Addition
-
In "models/config" :
a. Add boolean flag 'use_mome':True
.
b. number of text and fusion experts.
c. hidden dimension
d. number layers -
In "models/vision_transformer.py":
a. number of vision experts
b. return an output of each vision expert -
In "models/vision_language_model.py":
In class "VisionLanguageModel":
a. Instantiation and implementation of Vision encoder, decoder
b. Embedding text tokens while encoding
c. Loss computation -
New file: "models/mome_utils.py":
a. Normalized weights for experts
b. balancing loss for balancing the experts
References:
-
VLMo: Introduces Mixture-of-Modality-Experts(MOME) Transformer that can encode various modalities (images, text, and image-text pairs) within a Transformer block.
-
Uni-MoE: Introduces separate encoder for each modality and maps each modality into a unified representation space.
-
MoME: Introduces the idea of different experts for each modality.

- I²MoE: Proposes to add a small regularization term (gating loss) on the router outputs to prevent one expert from dominating to ensure both experts stay useful. Introduces a multilayer perceptron (MLP) that assigns importance scores to each expert’s output.