Experimenting with custom quants using `--tensor-type` #12741

ddh0 · 2025-04-03T19:56:34Z

ddh0
Apr 3, 2025

llama.cpp PR #12511 allows users to control quantization level based on tensor type, as further detailed in this blog post. The author includes a table of his personal quantization scheme, saying:

They seem to work well across the different LLM architectures I’m interested in, and achieve a decent size reduction with an acceptable level of quality for my purposes. Your needs however may/will be different, so feel free to experiment!

To that end I've done some testing of my own, and my full results are available in this gist. Here's the TL;DR:

Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):

	Q2_K	Crush TYPE_EMBD	Crush TYPE_FFN	Crush TYPE_ATTN	Crush TYPE_OUTPUT	Q8_0
Llama 3.2 3B	1.504	0.344	4.693	1.528	N/A	0.002
Qwen2.5-14B	1.393	0.016	2.160	0.606	0.921	0.005
Average	1.44	0.18	3.42	1.06	0.921	0.0035

In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.

EAddario · 2025-04-07T18:14:10Z

EAddario
Apr 7, 2025

I was able to finish testing over the weekend and got some interesting results, which I've summarized in the table below.

TL;DR: A combination of Tensor-Wise Quantization (TWQ) and Layer-Wise Quantization (LWQ) is useful to generate custom models. Using DeepSeek-R1-Distill-Llama-8B-Q4_K_M as an example, LWQ yields a 10.4% smaller model with only a 0.83% 𝜌PPL penalty compared to the naive model.

Test results

Model	Naive (GB)	TWQ (GB)	Reduction	LWQ (GB)	Reduction	Naive 𝜌PPL	TWQ 𝜌PPL	LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M	3.78	3.48	7.9%	3.67	2.9%	93.64%	91.75%	94.84%
DeepSeek-R1-Distill-Llama-8B-IQ3_S	3.68	3.24	12.0%	3.56	3.3%	93.71%	91.50%	93.48%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL	4.68	4.3	8.1%	4.4	6.0%	98.82%	96.44%	95.87%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L	4.32	3.45	20.1%	3.88	10.2%	97.25%	92.60%	94.83%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M	4.02	3.37	16.2%	3.57	11.2%	96.92%	91.45%	94.63%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S	3.66	3.28	10.4%	3.43	6.3%	94.59%	90.73%	92.46%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M	4.92	4.44	9.8%	4.41	10.4%	98.85%	98.02%	98.04%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S	4.69	4.31	8.1%	4.33	7.7%	99.01%	97.97%	97.57%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M	5.73	5.35	6.6%	5.38	6.1%	99.09%	98.83%	98.95%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S	5.6	5.19	7.3%	5.3	5.4%	99.00%	98.82%	98.82%
DeepSeek-R1-Distill-Llama-8B-Q6_K	6.6	6.17	6.5%	6.51	1.4%	99.47%	98.91%	99.18%
DeepSeek-R1-Distill-Llama-8B-Q8_0	8.54	7.84	8.2%	7.47	12.5%	99.93%	98.99%	98.97%

When running llama-quantize with default parameters it will typically generate models following the quantization pattern shown below. There are some exceptions to this, based on model architecture and other conditions, but a majority will conform to the following:

Naive quantization

Running the program with default options. For example:

llama-quantize --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M

Type	Embd	Out	Attn K	Attn Q	Attn V	Attn O	FFN G	FFN U	FFN D
Q8_0	q8_0	q6_k	q8_0	q8_0	q8_0	q8_0	q8_0	q8_0	q8_0
Q6_K	q6_k	q6_k	q6_k	q6_k	q6_k	q6_k	q6_k	q6_k	q6_k
Q5_K_M	q5_k	q6_k	q5_k	q5_k	q6_k	q5_k	q5_k	q5_k	q6_k
Q5_K_S	q5_k	q6_k	q5_k	q5_k	q5_k	q5_k	q5_k	q5_k	q5_k
Q4_K_M	q4_k	q6_k	q4_k	q4_k	q6_k	q4_k	q4_k	q4_k	q6_k
Q4_K_S	q4_k	q6_k	q4_k	q4_k	q4_k	q4_k	q4_k	q4_k	q4_k
Q3_K_L	q3_k	q6_k	q3_k	q3_k	q5_k	q5_k	q3_k	q3_k	q5_k
Q3_K_M	q3_k	q6_k	q3_k	q3_k	q4_k	q4_k	q3_k	q3_k	q4_k
Q3_K_S	q3_k	q6_k	q3_k	q3_k	q3_k	q3_k	q3_k	q3_k	q3_k
IQ4_NL	iq4_nl	q6_k	iq4_nl	iq4_nl	q5_k	iq4_nl	iq4_nl	iq4_nl	iq4_nl
IQ3_M	iq3_s	q6_k	iq3_s	iq3_s	q4_k	q4_k	iq3_s	iq3_s	iq3_s
IQ3_S	iq3_s	q6_k	iq3_s	iq3_s	q4_k	iq3_s	iq3_s	iq3_s	iq3_s

I have tried to keep to the spirit of the naming convention when generating my own models, but the proposed PR does provide the user with a high degree of freedom, and I think better quality with better compression ratios is possible.

I'll upload the test models and results to my HF repo in the next few days but for completeness, below are the combinations I'm using. A double arrow means the quant has been applied to the whole tensor either at a higher (↑↑) or lower (↓↓) level than what the naive process would do, whilst a single arrow (↑) (↓) signifies that only some layers have been quantized at the given level.

Tensor-wise quantization

Quantizing selected whole tensors at a given level. For example:

llama-quantize --output-tensor-type q3_k --token-embedding-type q3_k --tensor-type ffn_down=q4_1 --tensor-type attn_v=q5_1 --tensor-type attn_q=q3_k --tensor-type attn_k=q3_k --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M

Type	Embd (↓↓)	Out (↓↓)	Attn K (↓↓)	Attn Q (↓↓)	Attn V (↑↑)	FFN D (~)
Q8_0	q3_k	q3_k	q6_k	q6_k	f16	q8_0
Q6_K	q3_k	q3_k	q5_1	q5_1	q8_0	q6_k
Q5_K_M	q3_k	q3_k	q4_1	q4_1	q6_k	q5_1
Q5_K_S	q3_k	q3_k	q4_k	q4_k	q6_k	q5_k
Q4_K_M	q3_k	q3_k	q3_k	q3_k	q5_1	q4_1
Q4_K_S	q3_k	q3_k	q3_k	q3_k	q5_k	q4_k
Q3_K_L	q2_k	q2_k	q2_k	q2_k	q4_1	q3_k
Q3_K_M	q2_k	q2_k	iq2_s	iq2_s	q4_k	q3_k
Q3_K_S	q2_k	q2_k	iq2_xs	iq2_xs	q4_0	q3_k
IQ4_NL	iq3_s	iq3_s	iq3_s	iq3_s	iq4_nl	iq4_nl
IQ3_M	iq3_s	iq3_s	iq2_s	iq2_s	iq4_nl	iq3_s
IQ3_S	iq3_xxs	iq3_xxs	iq2_xs	iq2_xs	iq4_xs	iq3_xxs

Layer-wise quantization

Quantizing selected tensors and layers at a given level. For example:

llama-quantize --token-embedding-type q3_k --output-tensor-type q4_k --tensor-type "\.(1[34689]|2[0-9]|30)\.attn_k=q3_k" --tensor-type "\.(1[34689]|2[0-9]|30)\.attn_q=q3_k" --tensor-type attn_v=q5_k --tensor-type "\.(1[34689]|2[0-9]|30)\.attn_v=q4_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_gate=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_up=q3_k" --tensor-type ffn_down=q5_k --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M

Type	Embd (↓↓)	Out (↓)	Attn K (↓)	Attn Q (↓)	Attn V (↑)	Attn O (d)	FFN G (↓)	FFN U (↓)	FFN D (↑)
Q8_0	q3_k	q6_k	q6_k	q6_k	f16/q8_0	default	q6_k	q6_k	q8_0
Q6_K	q3_k	q5_k	q5_k	q5_k	q8_0/q6_k	default	q5_k	q5_k	q8_0
Q5_K_M	q3_k	q5_k	q4_k	q4_k	q6_k/q5_k	default	q4_k	q4_k	q6_k
Q5_K_S	q3_k	q4_k	q4_k	q4_k	q5_k	default	q4_k	q4_k	q6_k/q5_k
Q4_K_M	q3_k	q4_k	q3_k	q3_k	q5_k/q4_k	default	q3_k	q3_k	q5_k
Q4_K_S	q3_k	q3_k	q3_k	q3_k	q4_k	default	q3_k	q3_k	q5_k/q4_k
Q3_K_L	q3_k	q3_k	q2_k	q2_k	q4_k	default	q2_k	q2_k	q5_k/q4_k
Q3_K_M	q3_k	q3_k	q2_k	q2_k	q4_k/q3_k	default	q2_k	q2_k	q4_k
Q3_K_S	q2_k	q3_k	q2_k	q2_k	q3_k	default	q2_k	q2_k	q4_k/q3_k
IQ4_NL	iq3_s	iq4_nl	iq3_s	iq3_s	iq4_nl/iq4_xs	default	iq3_s	iq3_s	q5_k
IQ3_M	iq3_xxs	iq3_s	iq3_xxs	iq3_xxs	iq4_nl/iq3_s	default	iq3_xxs	iq3_xxs	iq4_nl
IQ3_S	iq3_xxs	iq3_xxs	iq3_xxs	iq3_xxs	iq3_s	default	iq3_xxs	iq3_xxs	iq4_nl/iq3_s

0 replies

EAddario · 2025-04-12T14:37:05Z

EAddario
Apr 12, 2025

Got better quality mix using stats from the llama-imatrix. Test results and parameters below. Uploading models to HF

Test results

Model	Naive (GB)	TWQ (GB)	Reduction	LWQ (GB)	Reduction	Naive 𝜌PPL	TWQ 𝜌PPL	LWQ 𝜌PPL
DeepSeek-R1-Distill-Llama-8B-IQ3_M	3.78	3.48	7.9%	3.69	2.5%	93.64%	91.75%	94.24%
DeepSeek-R1-Distill-Llama-8B-IQ3_S	3.68	3.24	12.0%	3.43	6.8%	93.71%	91.50%	92.97%
DeepSeek-R1-Distill-Llama-8B-IQ4_NL	4.68	4.30	8.1%	4.39	6.1%	98.82%	96.44%	96.12%
DeepSeek-R1-Distill-Llama-8B-Q3_K_L	4.32	3.45	20.1%	3.76	13.0%	97.25%	92.60%	94.79%
DeepSeek-R1-Distill-Llama-8B-Q3_K_M	4.02	3.37	16.2%	3.56	11.3%	96.92%	91.45%	94.45%
DeepSeek-R1-Distill-Llama-8B-Q3_K_S	3.66	3.28	10.4%	3.31	9.7%	94.59%	90.73%	92.23%
DeepSeek-R1-Distill-Llama-8B-Q4_K_M	4.92	4.44	9.8%	4.41	10.5%	98.85%	98.02%	98.03%
DeepSeek-R1-Distill-Llama-8B-Q4_K_S	4.69	4.31	8.1%	4.28	8.8%	99.01%	97.97%	97.72%
DeepSeek-R1-Distill-Llama-8B-Q5_K_M	5.73	5.35	6.6%	5.38	6.2%	99.09%	98.83%	98.94%
DeepSeek-R1-Distill-Llama-8B-Q5_K_S	5.60	5.19	7.3%	5.24	6.4%	99.00%	98.82%	98.85%
DeepSeek-R1-Distill-Llama-8B-Q6_K	6.60	6.17	6.5%	6.57	0.5%	99.47%	98.91%	99.19%
DeepSeek-R1-Distill-Llama-8B-Q8_0	8.54	7.84	8.2%	7.73	9.4%	99.93%	98.99%	99.26%

Layer-wise quantization

Running the program with default options. For example:

llama-quantize --token-embedding-type q3_k --output-tensor-type q4_k --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.attn_k=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.attn_q=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.attn_v=q4_k" --tensor-type attn_v=q5_k --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_gate=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_up=q3_k" --tensor-type ffn_down=q5_k --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M

Type	Embd (↓↓)	Out (=)	Attn K (↓)	Attn Q (↓)	Attn V (↑)	Attn O (d)	FFN G (↓)	FFN U (↓)	FFN D (↑)
Q8_0	q3_k	q8_0	q6_k	q6_k	q8_0/f16	default	q6_k	q6_k	q8_0
Q6_K	q3_k	q6_k	q5_k	q5_k	q6_kq/8_0	default	q5_k	q5_k	q8_0
Q5_K_M	q3_k	q5_k	q4_k	q4_k	q5_k/q6_k	default	q4_k	q4_k	q6_k
Q5_K_S	q3_k	q5_k	q4_k	q4_k	q5_k	default	q4_k	q4_k	q5_k/q6_k
Q4_K_M	q3_k	q4_k	q3_k	q3_k	q4_k/q5_k	default	q3_k	q3_k	q5_k
Q4_K_S	q3_k	q4_k	q3_k	q3_k	q4_k	default	q3_k	q3_k	q4_k/q5_k
Q3_K_L	q3_k	q3_k	q2_k	q2_k	q4_k	default	q2_k	q2_k	q4_k/q5_k
Q3_K_M	q3_k	q3_k	q2_k	q2_k	q3_k/q4_k	default	q2_k	q2_k	q4_k
Q3_K_S	q2_k	q3_k	q2_k	q2_k	q3_k	default	q2_k	q2_k	q3_k/q4_k
IQ4_NL	iq3_s	iq4_nl	iq3_s	iq3_s	iq4_xs/iq4_nl	default	iq3_s	iq3_s	q5_k
IQ3_M	iq3_s	iq3_s	iq3_xxs	iq3_xxs	iq3_s/iq4_nl	default	iq3_xxs	iq3_xxs	iq4_nl
IQ3_S	iq3_xxs	iq3_xxs	iq3_xxs	iq3_xxs	iq3_s	default	iq3_xxs	iq3_xxs	iq3_s/iq4_nl

2 replies

Djip007 Apr 19, 2025

👍
Can you get result with 'KL Divergence' or 'Same top p'
For your case only Q4 optimisation look interesting.

EAddario Apr 19, 2025

The full set of test results is available on my HF repo under the ./scores directory (DeepSeek-R1-Distill-Qwen-7B-GGUF, DeepSeek-R1-Distill-Llama-8B-GGUF, etc.)

bartowski1182 · 2025-04-16T17:32:25Z

bartowski1182
Apr 16, 2025

@ddh0 what are the commands you used to return these statistics like mean squared deviation?

@EAddario are your own staticstics purely PPL, and if so against what corpus did you decide to run it?

2 replies

ddh0 Apr 16, 2025
Author

@bartowski1182 My script for testing as well as the input texts can be found here: https://huggingface.co/ddh0/tensor-type-testing

bartowski1182 Apr 16, 2025

Oh perfect thank you, didn't think to check the files on HF itself lol

EAddario · 2025-04-19T09:28:46Z

EAddario
Apr 19, 2025

@bartowski1182 for PPL I'm using the standard wikitext-2-raw-v1. The full set of logits, imatrices, tests and results are available in my HF repo (i.e. DeepSeek-R1-Distill-Qwen-7B-GGUF & DeepSeek-R1-Distill-Llama-8B-GGUF).

Since I have your attention 🙂, just wanted to thank you for all your contributions! I'm a fan of your work!

0 replies

skatardude10 · 2025-05-02T17:57:56Z

skatardude10
May 2, 2025

Loving the new tensor-type option. Does anyone know if it's possible yet to specify by layer or a way to do that? Ex: looking to bump 55-60 to Q5 and 61-65 to Q6. Assuming this isn't possible yet?

3 replies

afsara-ben May 2, 2025

wondering the same, instead of a general block, can we choose which blocks i want quantized at which level

skatardude10 May 2, 2025

Would be awesome. Assuming with how transformers learn basic understanding towards the beginning layers and higher level abstractions and associations towards the later layers or would be cool to be able to selectively up precision where it makes sense based on use case.

EAddario May 3, 2025

Yes, you can 😊. The option supports regex patterns so, for example, --tensor-type "\.([0-9]|1[01257]|31)\.attn_k=q4_k" will quantize Attention Key tensors on layers 0 to 9, 10, 11, 12, 15, 17 and 31 at q4_k, leaving the remaining layers at their default value.

The current version requires the user to define the tensor's name in full (i.e. attn_k or ffn_gate, etc.). PR #13033 removes that restriction so that you could "layer-wise quantize" all attention tensors: --tensor-type "\.([0-9]|1[01257]|31)\.attn=q4_k or just V and K: --tensor-type "\.([0-9]|1[01257]|31)\.attn_(v|k)=q4_k

There's another PR #12718 to enable displaying imatrix statistics to help guide which tensors to up/down quantize. That's how I'm generating experimental versions of a few models. Feel free to check them out!

Experimenting with custom quants using --tensor-type #12741

Uh oh!

Uh oh!

Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):

Replies: 5 comments · 7 replies

Uh oh!

Test results

Naive quantization

Tensor-wise quantization

Layer-wise quantization

Uh oh!

Test results

Layer-wise quantization

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ddh0 Apr 16, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Experimenting with custom quants using `--tensor-type` #12741

Replies: 5 comments 7 replies

ddh0 Apr 16, 2025
Author