Skip to content

Candle-Based Inferencing #4361

@sempervictus

Description

@sempervictus

Please describe the feature you want

Candle-vllm, mistralrs, or candle-based primitives for model handling should provider tighter coupling and possibly better performance. At present, candle-vllm can muster ~55T/s on a q8_0 Qwen3-Coder even on NVCC7 hardware (Volta generation) with a 512k context (fairly stable into ~400k range due to how it handles ISQ and attention) whereas llamacpp gets a fraction of that and seems to forget what it was doing earlier into large context windows.

Additional context
Add any other context or screenshots about the feature request here.


Please reply with a 👍 if you want this feature.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions