Add support for SmallThinker model series #14898

wdl339 · 2025-07-27T07:17:12Z

Purpose

SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS (the team behind the high-speed inference framework PowerInfer) and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.

This PR is to add support for the SmallThinker series of models to llama.cpp.

Modifications

Add support for SmallthinkerForCausalLM model conversion in convert-hf-to-gguf.py.
Add new LLM_ARCH_SMALLTHINKER architecture.
Add support for inference for models based on LLM_ARCH_SMALLTHINKER.
Implement a new function build_moe_ffn_from_probs , to handle SmallThinker's unique architecture where the MoE router is positioned before the attention block.
Implement a new function set_dense_start_swa_pattern . While the existing set_swa_pattern function enables a pattern where every Nth layer is dense, starting the count from SWA layers, the new function allows the pattern to start with a dense layer.

Testing

Clone the model from https://huggingface.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and use convert-hf-to-gguf.py to convert to gguf format.

./build/bin/llama-cli -m /mnt/m2_4/wdl/smallthinker-4b.gguf -p "The meaning of life is" -n 64

The meaning of life is a profound and deeply personal question! There is no single answer, and different perspectives offer varying insights. Here are some major approaches to understanding it:

1. **Existential Meaning-Making**  
   Philosophers like Sartre argue life has no inherent meaning—*we create our own purpose

full output

./build/bin/llama-cli -m /mnt/m2_4/wdl/smallthinker-4b.gguf -p "The meaning of life is" -n 64
build: 6006 (92b518b4) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 31 key-value pairs and 323 tensors from /mnt/m2_4/wdl/smallthinker-4b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = smallthinker
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = 4b_v7
llama_model_loader: - kv   3:                           general.finetune str              = 4b_v7
llama_model_loader: - kv   4:                         general.size_label str              = 32x758M
llama_model_loader: - kv   5:                   smallthinker.block_count u32              = 32
llama_model_loader: - kv   6:                smallthinker.context_length u32              = 32768
llama_model_loader: - kv   7:              smallthinker.embedding_length u32              = 1536
llama_model_loader: - kv   8:          smallthinker.attention.head_count u32              = 12
llama_model_loader: - kv   9:       smallthinker.attention.head_count_kv u32              = 2
llama_model_loader: - kv  10:                smallthinker.rope.freq_base f32              = 1500000.000000
llama_model_loader: - kv  11: smallthinker.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:          smallthinker.attention.key_length u32              = 128
llama_model_loader: - kv  13:        smallthinker.attention.value_length u32              = 128
llama_model_loader: - kv  14:                          general.file_type u32              = 1
llama_model_loader: - kv  15:                  smallthinker.expert_count u32              = 32
llama_model_loader: - kv  16:             smallthinker.expert_used_count u32              = 4
llama_model_loader: - kv  17:    smallthinker.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  18:           smallthinker.feed_forward_length u32              = 768
llama_model_loader: - kv  19:            smallthinker.expert_gating_func u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type  f16:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 7.95 GiB (16.01 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = smallthinker
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1536
print_info: n_layer          = 32
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 768
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 4.27 B
print_info: general.name     = 4b_v7
print_info: n_ff_exp         = 768
print_info: expert_gating_func   = sigmoid
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  8144.63 MiB
............................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = true
llama_context: freq_base     = 1500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:        CPU KV buffer size =   128.00 MiB
llama_kv_cache_unified: size =  128.00 MiB (  4096 cells,  32 layers,  1/ 1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:        CPU compute buffer size =   299.75 MiB
llama_context: graph nodes  = 1799
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 32068148
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
The meaning of life is
assistant
The meaning of life is a profound and deeply personal question! There is no single answer, and different perspectives offer varying insights. Here are some major approaches to understanding it:

1. **Existential Meaning-Making**  
   Philosophers like Sartre argue life has no inherent meaning—*we create our own purpose

src/llama-hparams.h

…n function

convert_hf_to_gguf.py

src/llama-model.cpp

CISC · 2025-07-27T21:27:24Z

src/llama-graph.cpp

@@ -938,6 +938,100 @@ ggml_tensor * llm_graph_context::build_moe_ffn(
    return moe_out;
 }

+ggml_tensor * llm_graph_context::build_moe_ffn_from_probs(


The code duplication is unfortunate, is it possible to merge this into build_moe_ffn with probs as a toggle without making too much of a mess?

Can be a follow-up.

That's a great point. I've been thinking about the best way to merge these and have a couple of ideas on how we could approach it.

As you suggested, we could modify build_moe_ffn to accept an optional probs parameter. The main difficulty here is that the logic for weight normalization and activation functions diverges significantly between the two paths, so it would require some careful internal branching to keep it clean.

Alternatively, we could extract the initial router logic (logits and probs calculation) into its own function. build_moe_ffn would then have a check at the beginning to decide whether to call this new router function. My main concern with this approach is that build_moe_ffn is a core function, and I'm a bit worried about affecting other models, so this would need careful testing.

Both approaches seem feasible. Given the complexity and your suggestion that this can be a follow-up, would you prefer I handle this in a separate PR, or should I proceed with one of these solutions here?

A separate PR is probably best.

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]>

Co-authored-by: Sigbjørn Skjæret <[email protected]>

src/llama-model.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

wdl339 added 7 commits July 8, 2025 05:09

support smallthinker

efe27eb

support 20b softmax, 4b no sliding window

a6d6eaf

Merge branch 'master' into smallthinker

a5274b7

new build_moe_ffn_from_probs, and can run 4b

8e2cb21

fix 4b rope bug

e28d2c5

Merge branch 'master' into smallthinker

ebd78cc

fix python type check

8c6af02

github-actions bot added the python python script changes label Jul 27, 2025

remove is_moe judge

92b518b

ggerganov reviewed Jul 27, 2025

View reviewed changes

src/llama-hparams.h Outdated Show resolved Hide resolved

wdl339 added 2 commits July 27, 2025 16:01

remove set_dense_start_swa_pattern function and modify set_swa_patter…

4186bab

…n function

trim trailing whitespace

f1d4698

wdl339 marked this pull request as ready for review July 27, 2025 17:38

CISC requested changes Jul 27, 2025

View reviewed changes

wdl339 and others added 5 commits July 28, 2025 08:37

remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py

f10cd46

Co-authored-by: Sigbjørn Skjæret <[email protected]>

better whitespace

4af8b59

Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]>

use GGML_ASSERT for expert count validation

e2c900c

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Improve null pointer check for probs

594af99

Co-authored-by: Sigbjørn Skjæret <[email protected]>

use template parameter for SWA attention logic

29e1fe0

ggerganov reviewed Jul 28, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

src/llama-model.cpp Outdated Show resolved Hide resolved

wdl339 and others added 3 commits July 28, 2025 14:25

better whitespace

5d09d11

Co-authored-by: Georgi Gerganov <[email protected]>

move the creation of inp_out_ids before the layer loop

bb3dd58

remove redundant judge for probs

e338c30

CISC approved these changes Jul 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for SmallThinker model series #14898

Add support for SmallThinker model series #14898

wdl339 commented Jul 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC Jul 27, 2025

Uh oh!

wdl339 Jul 28, 2025 •

edited

Loading

Uh oh!

CISC Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add support for SmallThinker model series #14898

Are you sure you want to change the base?

Add support for SmallThinker model series #14898

Conversation

wdl339 commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Modifications

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CISC Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

wdl339 Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CISC Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wdl339 commented Jul 27, 2025 •

edited

Loading

wdl339 Jul 28, 2025 •

edited

Loading