Add Mistral3 vision-language model support (For Flux2 Migration) by SpenserCai · Pull Request #3246 · huggingface/candle

SpenserCai · 2025-12-16T12:52:34Z

Summary

This PR adds support for the Mistral3 (Mistral-Small-3.x) vision-language model to candle-transformers. Mistral3 combines the Pixtral vision encoder with the Mistral language model, enabling multimodal image-text understanding.

Note: This PR is a preparatory step for the upcoming Flux2 model migration, as Flux2 shares similar multimodal architecture patterns with Mistral3.

Changes

New files in candle-transformers/src/models/mistral3/:

mod.rs - Module exports and documentation
config.rs - Mistral3Config with vision, text, and projector settings
model.rs - Mistral3Model and Mistral3ForConditionalGeneration
patch_merger.rs - PatchMerger for reducing image tokens
projector.rs - MultiModalProjector (RMSNorm + PatchMerger + MLP)

Modified files:

candle-transformers/src/models/mod.rs - Added mistral3 module export
candle-transformers/src/models/pixtral/vision_model.rs - Added forward_with_hidden_states() and VisionModelOutput struct
candle-transformers/src/models/mistral.rs - Added forward_embeds_hidden() for multimodal integration

Architecture

Mistral3ForConditionalGeneration
├── Mistral3Model
│   ├── vision_tower (Pixtral Vision Model, 24 layers)
│   ├── multi_modal_projector
│   │   ├── norm (RMSNorm)
│   │   ├── patch_merger (spatial_merge_size=2, reduces tokens by 4x)
│   │   ├── linear_1
│   │   ├── act (GELU)
│   │   └── linear_2
│   └── language_model (Mistral, 40 layers)
└── lm_head

Key Implementation Details

PatchMerger: Uses reshape + permute to implement PyTorch's unfold operation (kernel_size == stride, no overlap), merging 2x2 patches into one.
Image Token Replacement: Implements replace_image_tokens() as Candle equivalent of PyTorch's masked_scatter.
Vision Tower Integration: Uses forward_with_hidden_states() to get batch-dimension-preserved output matching PyTorch Transformers behavior.

Supported Models

Differences from Pixtral LLaVA

Feature	Pixtral LLaVA	Mistral3
PatchMerger	❌	✅ (spatial_merge_size=2)
Projector RMSNorm	❌	✅
Projector bias	✅	❌
Image token reduction	1x	4x

Usage

use candle_transformers::models::mistral3::{Mistral3Config, Mistral3ForConditionalGeneration};

let config: Mistral3Config = serde_json::from_str(&config_str)?;
let model = Mistral3ForConditionalGeneration::new(&config, vb)?;
let logits = model.forward(&input_ids, Some(&pixel_values), Some(&image_sizes), 0)?;

Verification

The implementation has been verified against PyTorch Transformers reference:

Vision Tower: avg_diff = 2.29e-4
MultiModal Projector: avg_diff = 3.61e-8
Full Forward Pass: Predicted token matches (token ID: 1784 "The")

Checklist

New model implementation follows existing patterns in candle-transformers
Configuration uses serde for JSON deserialization
Reuses existing components (Pixtral vision, Mistral language model)
Documentation comments included
Verified against PyTorch reference implementation

SpenserCai · 2025-12-22T09:30:03Z

mistral3 examples added!

SpenserCai · 2025-12-23T16:37:56Z

Fixed clippy and fmt.

ivarflakstad · 2026-01-20T19:49:49Z

Is 24B the smallest model for this schenario? It's pretty slow on metal.

SpenserCai · 2026-01-21T00:05:09Z

Is 24B the smallest model for this schenario? It's pretty slow on metal.

Yes, the quantitative version is out, which is the smallest model.

SpenserCai · 2026-01-22T15:37:41Z

Is 24B the smallest model for this schenario? It's pretty slow on metal.

Yes, the quantitative version is out, which is the smallest model.

@ivarflakstad Is there anything I need to change about this pr?🤔

ivarflakstad · 2026-01-23T09:39:30Z

Not sure. Focusing on stabilizing the next release.
The example seems a bit slow to me, but it could be the model size. I haven't dug deep in the implementation.

SpenserCai added 7 commits December 16, 2025 14:39

mistral3 init

05ba353

support forward_with_hidden_states in pixtral

eb9ef69

complete!

7f07ade

update

15ba84a

add mistral3 example

1a36214

unified dtype type

d7518b1

Fix duplicate loading of lm_head

8030c36

SpenserCai added 2 commits December 23, 2025 10:48

Merge branch 'huggingface:main' into mistralai3_support

ec9be03

fixed clippy and fmt

30e52d3

SpenserCai added 4 commits December 24, 2025 13:09

Merge branch 'huggingface:main' into mistralai3_support

445447c

Merge branch 'main' into mistralai3_support

88e84f2

Merge branch 'main' into mistralai3_support

182f586

Merge branch 'main' into mistralai3_support

b391de6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mistral3 vision-language model support (For Flux2 Migration)#3246

Add Mistral3 vision-language model support (For Flux2 Migration)#3246
SpenserCai wants to merge 13 commits intohuggingface:mainfrom
SpenserCai:mistralai3_support

SpenserCai commented Dec 16, 2025 •

edited

Loading

Uh oh!

SpenserCai commented Dec 22, 2025

Uh oh!

SpenserCai commented Dec 23, 2025

Uh oh!

ivarflakstad commented Jan 20, 2026

Uh oh!

SpenserCai commented Jan 21, 2026

Uh oh!

SpenserCai commented Jan 22, 2026

Uh oh!

ivarflakstad commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SpenserCai commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Architecture

Key Implementation Details

Supported Models

Differences from Pixtral LLaVA

Usage

Verification

Checklist

Uh oh!

SpenserCai commented Dec 22, 2025

Uh oh!

SpenserCai commented Dec 23, 2025

Uh oh!

ivarflakstad commented Jan 20, 2026

Uh oh!

SpenserCai commented Jan 21, 2026

Uh oh!

SpenserCai commented Jan 22, 2026

Uh oh!

ivarflakstad commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SpenserCai commented Dec 16, 2025 •

edited

Loading