Replies: 4 comments 4 replies
-
I was able to finish testing over the weekend and got some interesting results, which I've summarized in the table below. TL;DR: A combination of Tensor-Wise Quantization (TWQ) and Layer-Wise Quantization (LWQ) is useful to generate custom models. Using DeepSeek-R1-Distill-Llama-8B-Q4_K_M as an example, LWQ yields a 10.4% smaller model with only a 0.83% 𝜌PPL penalty compared to the naive model. Test results
When running Naive quantizationRunning the program with default options. For example: llama-quantize --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M
I have tried to keep to the spirit of the naming convention when generating my own models, but the proposed PR does provide the user with a high degree of freedom, and I think better quality with better compression ratios is possible. I'll upload the test models and results to my HF repo in the next few days but for completeness, below are the combinations I'm using. A double arrow means the quant has been applied to the whole tensor either at a higher (↑↑) or lower (↓↓) level than what the naive process would do, whilst a single arrow (↑) (↓) signifies that only some layers have been quantized at the given level. Tensor-wise quantizationQuantizing selected whole tensors at a given level. For example: llama-quantize --output-tensor-type q3_k --token-embedding-type q3_k --tensor-type ffn_down=q4_1 --tensor-type attn_v=q5_1 --tensor-type attn_q=q3_k --tensor-type attn_k=q3_k --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M
Layer-wise quantizationQuantizing selected tensors and layers at a given level. For example: llama-quantize --token-embedding-type q3_k --output-tensor-type q4_k --tensor-type "\.(1[34689]|2[0-9]|30)\.attn_k=q3_k" --tensor-type "\.(1[34689]|2[0-9]|30)\.attn_q=q3_k" --tensor-type attn_v=q5_k --tensor-type "\.(1[34689]|2[0-9]|30)\.attn_v=q4_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_gate=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_up=q3_k" --tensor-type ffn_down=q5_k --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M
|
Beta Was this translation helpful? Give feedback.
-
Got better quality mix using stats from the llama-imatrix. Test results and parameters below. Uploading models to HF Test results
Layer-wise quantizationRunning the program with default options. For example: llama-quantize --token-embedding-type q3_k --output-tensor-type q4_k --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.attn_k=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.attn_q=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.attn_v=q4_k" --tensor-type attn_v=q5_k --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_gate=q3_k" --tensor-type "\.(1[6-9]|2[0-9]|30|31)\.ffn_up=q3_k" --tensor-type ffn_down=q5_k --imatrix matrix.dat Model-27B-F32.gguf Model-27B-Q4_K_M.gguf Q4_K_M
|
Beta Was this translation helpful? Give feedback.
-
@ddh0 what are the commands you used to return these statistics like mean squared deviation? @EAddario are your own staticstics purely PPL, and if so against what corpus did you decide to run it? |
Beta Was this translation helpful? Give feedback.
-
@bartowski1182 for PPL I'm using the standard wikitext-2-raw-v1. The full set of logits, imatrices, tests and results are available in my HF repo (i.e. DeepSeek-R1-Distill-Qwen-7B-GGUF & DeepSeek-R1-Distill-Llama-8B-GGUF). Since I have your attention 🙂, just wanted to thank you for all your contributions! I'm a fan of your work! |
Beta Was this translation helpful? Give feedback.
-
llama.cpp PR #12511 allows users to control quantization level based on tensor type, as further detailed in this blog post. The author includes a table of his personal quantization scheme, saying:
To that end I've done some testing of my own, and my full results are available in this gist. Here's the TL;DR:
Mean-Squared Deviation as compared to BF16, average over 10 inputs (lower is better):
In short, we can see that aggressive quantization of the FFN tensors causes the greatest deviation from BF16, and aggressive quantization of the token embeddings causes the least deviation. Note that deviations greater than ~0.1 start to have a noticeable effect on the quality of the model's output. Realistically, it's probably wise to stick to any combination of Q3_K, Q4_K, Q5_K, Q6_K, and Q8_0 depending on your situation.
Beta Was this translation helpful? Give feedback.
All reactions