Candle-Based Inferencing

**Please describe the feature you want**

Candle-vllm, mistralrs, or candle-based primitives for model handling should provider tighter coupling and possibly better performance. At present, `candle-vllm` can muster ~55T/s on a `q8_0` Qwen3-Coder even on NVCC7 hardware (Volta generation) with a 512k context (fairly stable into ~400k range due to how it handles ISQ and attention) whereas llamacpp gets a fraction of that and seems to forget what it was doing earlier into large context windows.

**Additional context**
Add any other context or screenshots about the feature request here.

---
Please reply with a 👍 if you want this feature.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Candle-Based Inferencing #4361

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Candle-Based Inferencing #4361

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions