Mixture of Modality-Experts (MoME) Integration

I would like to add a **Mixture of Modality Experts (MoME)** block to **nanoVLM** in order to enable dynamic, modality‐specific routing between a small vision expert and a small text expert at each cross‐modal layer. The goal is to introduce a lightweight MoME architecture that allows it to learn richer modality‐specific representations.

### Motivation

nanoVLM currently relies on a single shared transformer to process combined image and text inputs. Prior work shows that splitting modality processing into separate "experts" and routing between them can:  
- Reduce cross-modal interference
- Enable modality-aware feature learning
- Provide interpretability via gating weights
Integrating a lightweight Mixture-of-Modality-Experts (MoME) block into nanoVLM will let it learn to “soft switch” between a vision expert and a text expert per input, improving robustness without significantly increasing parameters.

### Potential benefits

1. **Reduced Modality Interference**
    Separate routing of _vision inputs through a Vision Expert_ and _text inputs through a Text Expert_ prevents a single expert from juggling both modalities (MoME).
  
2. **Modality-Aware Routing**
A small MLP router  to decide how much weight to give to each expert (for every single input (every image+text pair)) (concatenating pooled vision/text embeddings → softmax) and multiply the  output vector of each expert by its respective weight.

3. **Interpretable Gating**  
Logging "router weights" tell if an input was mostly handled by the Vision Expert or Text Expert. I²MoE suggests lightweight gating losses to encourage the router to use both experts in a meaningful way, rather than collapse to one .

### Workflow

When you turn on MoME in nanoVLM, the model sends the picture through a small “vision expert” network and the words through a separate “text expert” network. Each expert distills its input into a representation of what it sees or reads. Then a tiny decision maker (the “router”) looks at both summaries and weighs each and blends them into one final representation. This representation goes into the usual nanoVLM head to make predictions. During training, the model learns both how to blend the representation and when to lean on vision versus text. Gating loss penalizes the model if it leans on one side.

### Proposed Addition

1. In "models/config" :
   a. Add boolean flag 'use_mome': `True`.
   b. number of text and fusion experts.
   c. hidden dimension
   d. number layers

2. In "models/vision_transformer.py":
    a. number of vision experts
    b. return an output of each vision expert

3. In "models/vision_language_model.py":
    In class "VisionLanguageModel":
    a. Instantiation and implementation of Vision encoder, decoder
    b. Embedding text tokens while encoding
    c. Loss computation

4. **New file:** "models/mome_utils.py":
    a. Normalized weights for experts
    b. balancing loss for balancing the experts

### References:

1. [VLMo](https://arxiv.org/pdf/2111.02358): Introduces Mixture-of-Modality-Experts(MOME) Transformer that can encode various modalities (images, text, and image-text pairs) within a Transformer block. 
![Image](https://github.com/user-attachments/assets/5cb22ea2-0d4e-42e0-aa09-0e71433d3fbb)

2. [Uni-MoE](https://arxiv.org/pdf/2405.11273): Introduces separate encoder for each modality and maps each modality into a unified representation space.

3. [MoME](https://arxiv.org/pdf/2407.12709): Introduces the idea of different experts for each modality.

<img width="701" alt="Image" src="https://github.com/user-attachments/assets/09c97da8-1a28-4487-9389-cbf6f6149471" />

4. [I²MoE](https://arxiv.org/pdf/2505.19190): Proposes to add a small regularization term (gating loss) on the router outputs to prevent one expert from dominating to ensure both experts stay useful. Introduces a multilayer perceptron (MLP) that assigns importance scores to each expert’s output.

![Image](https://github.com/user-attachments/assets/e47971f6-749f-401d-8c36-f24150aebaa6)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixture of Modality-Experts (MoME) Integration #96

Motivation

Potential benefits

Workflow

Proposed Addition

References:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mixture of Modality-Experts (MoME) Integration #96

Description

Motivation

Potential benefits

Workflow

Proposed Addition

References:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions