Skip to content

Add support for SmallThinker model series #14898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

wdl339
Copy link

@wdl339 wdl339 commented Jul 27, 2025

Purpose

SmallThinker is a family of on-device native Mixture-of-Experts (MoE) language models specially designed for local deployment, co-developed by the IPADS (the team behind the high-speed inference framework PowerInfer) and School of AI at Shanghai Jiao Tong University and Zenergize AI. Designed from the ground up for resource-constrained environments, SmallThinker brings powerful, private, and low-latency AI directly to your personal devices, without relying on the cloud.

This PR is to add support for the SmallThinker series of models to llama.cpp.

Modifications

  • Add support for SmallthinkerForCausalLM model conversion in convert-hf-to-gguf.py.
  • Add new LLM_ARCH_SMALLTHINKER architecture.
  • Add support for inference for models based on LLM_ARCH_SMALLTHINKER.
  • Implement a new function build_moe_ffn_from_probs , to handle SmallThinker's unique architecture where the MoE router is positioned before the attention block.
  • Implement a new function set_dense_start_swa_pattern . While the existing set_swa_pattern function enables a pattern where every Nth layer is dense, starting the count from SWA layers, the new function allows the pattern to start with a dense layer.

Testing

Clone the model from https://huggingface.co/PowerInfer/SmallThinker-4BA0.6B-Instruct and use convert-hf-to-gguf.py to convert to gguf format.

./build/bin/llama-cli -m /mnt/m2_4/wdl/smallthinker-4b.gguf -p "The meaning of life is" -n 64

The meaning of life is a profound and deeply personal question! There is no single answer, and different perspectives offer varying insights. Here are some major approaches to understanding it:

1. **Existential Meaning-Making**  
   Philosophers like Sartre argue life has no inherent meaning—*we create our own purpose

full output
./build/bin/llama-cli -m /mnt/m2_4/wdl/smallthinker-4b.gguf -p "The meaning of life is" -n 64
build: 6006 (92b518b4) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 31 key-value pairs and 323 tensors from /mnt/m2_4/wdl/smallthinker-4b.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = smallthinker
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = 4b_v7
llama_model_loader: - kv   3:                           general.finetune str              = 4b_v7
llama_model_loader: - kv   4:                         general.size_label str              = 32x758M
llama_model_loader: - kv   5:                   smallthinker.block_count u32              = 32
llama_model_loader: - kv   6:                smallthinker.context_length u32              = 32768
llama_model_loader: - kv   7:              smallthinker.embedding_length u32              = 1536
llama_model_loader: - kv   8:          smallthinker.attention.head_count u32              = 12
llama_model_loader: - kv   9:       smallthinker.attention.head_count_kv u32              = 2
llama_model_loader: - kv  10:                smallthinker.rope.freq_base f32              = 1500000.000000
llama_model_loader: - kv  11: smallthinker.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:          smallthinker.attention.key_length u32              = 128
llama_model_loader: - kv  13:        smallthinker.attention.value_length u32              = 128
llama_model_loader: - kv  14:                          general.file_type u32              = 1
llama_model_loader: - kv  15:                  smallthinker.expert_count u32              = 32
llama_model_loader: - kv  16:             smallthinker.expert_used_count u32              = 4
llama_model_loader: - kv  17:    smallthinker.expert_feed_forward_length u32              = 768
llama_model_loader: - kv  18:           smallthinker.feed_forward_length u32              = 768
llama_model_loader: - kv  19:            smallthinker.expert_gating_func u32              = 2
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type  f16:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 7.95 GiB (16.01 BPW) 
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = smallthinker
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 1536
print_info: n_layer          = 32
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 768
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 4B
print_info: model params     = 4.27 B
print_info: general.name     = 4b_v7
print_info: n_ff_exp         = 768
print_info: expert_gating_func   = sigmoid
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  8144.63 MiB
............................................................................................
llama_context: constructing llama_context
llama_context: non-unified KV cache requires ggml_set_rows() - forcing unified KV cache
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = true
llama_context: freq_base     = 1500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:        CPU KV buffer size =   128.00 MiB
llama_kv_cache_unified: size =  128.00 MiB (  4096 cells,  32 layers,  1/ 1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:        CPU compute buffer size =   299.75 MiB
llama_context: graph nodes  = 1799
llama_context: graph splits = 1
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
*** User-specified prompt will pre-start conversation, did you mean to set --system-prompt (-sys) instead?
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant


system_info: n_threads = 8 (n_threads_batch = 8) / 32 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: interactive mode on.
sampler seed: 32068148
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 64, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

user
The meaning of life is
assistant
The meaning of life is a profound and deeply personal question! There is no single answer, and different perspectives offer varying insights. Here are some major approaches to understanding it:

1. **Existential Meaning-Making**  
   Philosophers like Sartre argue life has no inherent meaning—*we create our own purpose

@github-actions github-actions bot added the python python script changes label Jul 27, 2025
@wdl339 wdl339 marked this pull request as ready for review July 27, 2025 17:38
@@ -938,6 +938,100 @@ ggml_tensor * llm_graph_context::build_moe_ffn(
return moe_out;
}

ggml_tensor * llm_graph_context::build_moe_ffn_from_probs(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code duplication is unfortunate, is it possible to merge this into build_moe_ffn with probs as a toggle without making too much of a mess?

Can be a follow-up.

Copy link
Author

@wdl339 wdl339 Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point. I've been thinking about the best way to merge these and have a couple of ideas on how we could approach it.

  1. As you suggested, we could modify build_moe_ffn to accept an optional probs parameter. The main difficulty here is that the logic for weight normalization and activation functions diverges significantly between the two paths, so it would require some careful internal branching to keep it clean.
  2. Alternatively, we could extract the initial router logic (logits and probs calculation) into its own function. build_moe_ffn would then have a check at the beginning to decide whether to call this new router function. My main concern with this approach is that build_moe_ffn is a core function, and I'm a bit worried about affecting other models, so this would need careful testing.

Both approaches seem feasible. Given the complexity and your suggestion that this can be a follow-up, would you prefer I handle this in a separate PR, or should I proceed with one of these solutions here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate PR is probably best.

wdl339 and others added 5 commits July 28, 2025 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants