-
Notifications
You must be signed in to change notification settings - Fork 11.8k
DeepSeek V2/V3 MLA implementation #12801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also, in case anybody want to use this with the but you can work around it by compiling with |
The failures don't seem to be anything to do with me:
|
2 questions: Will it automatically use the new MLA during conversion going forward, or do I need to enable a specific options? Is this ready to start doing conversion/imatrix calculation or would I be wasting my time? |
It will just use MLA all the time now - the backward compatibility is only for old files (and the only way to use non-MLA now would be to run a version of |
It depends on whether it's accepted, but the way of converting it to MQA inside of |
It's not quite finished fine-tuning, but I should have some exciting news on the tiny draft models soon too: The magenta line is the The gain between the grey line (8-headed |
I forgot to turn flash attention off in the PR, so added that now: if (params.flash_attn && model->arch == LLM_ARCH_DEEPSEEK2) {
LLAMA_LOG_WARN("%s: flash_attn is not compatible with Deepseek2 - forcing off\n", __func__);
params.flash_attn = false;
} I don't think By forcing off like this, you should still be able to use |
https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0 https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF I'm still waiting for somebody to show me a printout of the token IDs for the Unsloth quants as apparently they changed the |
@ggerganov I've noticed that even with this added: if (params.flash_attn && model->arch == LLM_ARCH_DEEPSEEK2) {
LLAMA_LOG_WARN("%s: flash_attn is not compatible with Deepseek2 - forcing off\n", __func__);
params.flash_attn = false;
} when using a draft model and I think somewhere in This isn't a problem with my PR, but probably should be looked at in the future (along with any other unwanted global parameters the non-draft model might be picking up). |
I found that the FA code update in b4759 (CUDA implementation) caused unexpected drastic performance changes in my test cases. This might be related. |
I've got the draft models trained for https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0 https://huggingface.co/jukofyork/DeepSeek-V3-0324-DRAFT-0.5B-v1.0-GGUF I used the same mix of data as for #!/bin/bash
host_address=192.168.1.2
port_number=8080
# Store the original directory
ORIGINAL_DIR=$(pwd)
# Change to the target directory
cd ~/llama.cpp_MLA/llama.cpp/build/bin
# Turn off NUMA balancing
echo 0 | sudo tee /proc/sys/kernel/numa_balancing > /dev/null
# Ask for permission to drop caches
read -p "Do you want to drop caches? (y/n) " -n 1 -r
echo # Move to a new line
if [[ $REPLY =~ ^[Yy]$ ]]
then
echo "Dropping caches..."
echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null
fi
# Run the main command
./llama-server \
--host "$host_address" \
--port "$port_number" \
--model ~/models/gguf/deepseek-v3-0324-mla-Q4_K_L+BF16.gguf \
--alias "deepseek-v3-0324--Q4_K" \
--chat-template deepseek3 \
--n-gpu-layers 99 \
--numa distribute \
--override-tensor exps=CPU \
--override-kv "deepseek2.expert_used_count=int:6" \
--override-kv "deepseek2.expert_weights_scale=float:2.3" \
--ctx_size 32768 \
--batch-size 1024 \
--ubatch-size 256 \
--model-draft ~/models/gguf/draft_models/DeepSeek-V3-0324-DRAFT-0.5B-Q4_0.gguf \
--top-k 1 \
--samplers "top_k" \
--gpu-layers-draft 99 \
--draft-min 3 \
--draft-max 32 \
--draft-p-min 0.667
# Return to the original directory
cd "$ORIGINAL_DIR" with the custom function safe_sed_function() {
local file=$1
local function_signature=$2
local replacement=$3
# Create backup
cp "$file" "$file.bak"
# Perform the replacement using address range and c command
sed -i "${function_signature}/,/^}/c\\${replacement}" "$file"
# Clean up
rm "$file.bak"
echo "Successfully replaced function in $file"
echo "-------------------"
}
safe_sed_function "src/llama-quant.cpp" \
"/^static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor \\* tensor, llama_ftype ftype) {" \
"static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {\n\
const std::string name = ggml_get_name(tensor);\n\
if (name.find(\"_exps\") != std::string::npos) {\n\
return GGML_TYPE_Q4_K;\n\
} else if (name.find(\"attn_k_b\") != std::string::npos || name.find(\"attn_v_b\") != std::string::npos) {\n\
return GGML_TYPE_BF16;\n\
}\n\
return GGML_TYPE_Q6_K;\n\
}" I can generate over 11 tokens per second for refactoring tasks now on a machine with:
and around 35-40 tokens per second prompt processing. See #11446 (comment) for an explantion of why I'm using I've yet to really test how much quality is lost using 6 experts, the adjusted scale factor and |
@ngxson @ggerganov @slaren This is ready for review! I can keep it alive for now because of the copy of |
src/llama-model.cpp
Outdated
// TODO: the CUDA backend used to not support non-cont. RoPE, investigate removing this | ||
q_pe = ggml_cont(ctx0, q_pe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was already investigated and the rope should work correctly now, so no longer needed to have ggml_cont()
(see #12457 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version does not require changes to the KV cache implementation. Are there plans to update it in the future, or this is no longer needed?
// See llm_build_deepseek2() for why attn_factor has to be scaled for YaRN RoPE to work correctly. | ||
// See https://github.com/ggerganov/llama.cpp/discussions/7416 for detailed explanation. | ||
const float yarn_attn_factor_scaled = model.arch == LLM_ARCH_DEEPSEEK2 ? 1.0f / (1.0f + 0.1f * logf(1.0f / freq_scale)) : cparams.yarn_attn_factor; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be absorbed in the cparams.yarn_attn_factor
and deduplicated from here and llm_build_deepseek2
?
No, it turns out if you convert the MLA stuff to being MQA right at the start, then all of the existing code for the KV-cache works without any changes. If you don't convert to MQA and try to keep the ability to run as MHA, then it ended up a real mess and did need lots of changes. |
I think part of my performance regression may be something to do with the changes made to fix the as I now get slightly better generation speed using
Yeah, the back-ends could likely detect even more cases like this where you can "collapse" batches and/or "fill-in" inner dimensions with size 1, etc. |
@fairydreaming Could you test if there is any difference between having those permutes but not doing the views in Also, can you try changing It would be interesting to see if this helps your CPU setup as the data layout for that multiplication should be way more cache-friendly. |
@jukofyork Generation is slightly better with permutes alone but only with short contexts. Initially it's about 0.7 t/s more for empty context but the difference quickly goes down to about 0.1 t/s at 8k. The difference in prompt processing performance is negligible.
This change very slightly reduced the generation performance (about 0.1 t/s for short context sizes). |
Thanks! If the backends do want to add the ability to squash dimensions, then we'll need to change the permutation order to be like this, so it's good that it doesn't have any negative effects when not using the view optimisations.
Are you using a model with the |
@jukofyork it was a freshly converted one with this PR |
Ah, does your CPU support For CUDA the |
One issue with CUDA is that currently the support for non-contiguous tensors is much worse than with the CPU backend, particularly for quantized For batch sizes > 1 I'm not yet sure how to handle it; I could maybe adapt the MMQ code to handle the |
Feedback for reference: Deepseek R1 Q8_0 Configuration 1 (with MLA optimization): ... Configuration 2 (Original "PR" version with -ot ): |
I can confirm that using I haven't time currently to find the exact details though. |
* Merged using squash to remove all noise commit messages * Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large * Removed 3 conts (2x RoPE and 1x RMS-norm) * Changed to use `<cmath>` instead of `<math.h>` * Reverted removal of the 3 conts * Used `reshape` in `llm_graph_context::build_attn_mha()` * Use `k_pe = ggml_reshape` * Removed the 3 conts again * Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF * Removed MQA optimisation from `build_attn_mha()` as no gains now * Simplified `is_mla` branch in `llm_build_deepseek2()` * Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls * Fixed call to `build_attn` in `llm_build_t5_enc`
having a weird issue with microsoft's R1 tune.. I converted and quantized with this patch applied, then calculated imatrix from the quantization, but now i can't apply to imatrix:
(ignore the tensor cols not divisible part, it's the fact that i get imatrix size is different from tensor size) any idea what could be causing this? I assume it's from this change, but obviously I should probably roll back and verify if you think there's a chance it's unrelated |
The info in this seems relevant: ikawrakow/ik_llama.cpp#250 |
So it looks like there are 2 specific parts that could affect it 1 is in imatrix.cpp, changing:
to:
No idea what src1->ne[2] could possibly be.. The other is in llama.cpp (a section that's now in llama-quant.cpp) marked with Where we check if specifically the imatrix file is coming from an older conversion with standard attention, but that shouldn't be applicable to my case since i converted it myself with the new attention and then calculated the imatrix I suppose it's possible that that first line is the fix required but have no idea what it is :') |
The particularly odd part is that I'm able to inference it without issue, only |
I’m having severe degradation at long context since this commit. Using the newly quantized UD Unsloth quants (UD-IQ2_M and UD-Q2_K_XL). Quality is ok with short inputs but when I work with my standard 6-7K prompt I get garbled Chinese. This was not the case with the old pre-MLA UD-Q2_K_XL. Running with Q8_0 k cache oddly seems to resolve it, although quality is still not as good as pre-MLA. |
I get the same issue as @MB7979, seems to be reproducible on different setups. I have a Ryzen 7 7800X3D + 192GB RAM + 128GB VRAM (5090+4090x2+A6000), and I get gibberish normally at longer ctx, and for me -ctx q8_0 doesn't solve it. https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/discussions/2 |
@jukofyork Hey! Just wanted to ask if you know why CPU + QPU offloading with the MLA commit gives gibberish? See https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/discussions/2#680f679eb63b7f85d975d8c5 and many other questions specifically on the MLA commit - I converted both R1 and D3 with the new MLA commit.
I already:
Thanks and really appreciate if you could investigate this especially since R2 might be around the corner! See https://huggingface.co/unsloth/DeepSeek-R1-GGUF-UD, https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD and https://huggingface.co/unsloth/MAI-DS-R1-GGUF |
I'm away for another week so can't easily check this atm, but have you tried since this PR got merged earlier today: This might fix it. |
I think I misdiagnosed the problem in that PR; I'll make a PR for a better fix soon. But only FP16 models should be affected in the first place. |
Yes, that commit has not resolved the issue when using partial offload with the DeepSeek quants of R1 and V3 I’m using, unfortunately. |
No luck either here on latest commit. |
It seems the gibberish issue when offloading non experts is fixed since some days ago, if someone was wondering https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/discussions/2#68192917c3d212ad5b33964d @MB7979 may confirm |
Yes, can confirm. Gibberish seems to be completely resolved now. Thanks to @JohannesGaessler for sorting that. |
This should hopefully be my final PR for this.
What is does:
n_embd_head_k_mla
/n_embd_head_v_mla
metadata (see below).llm_graph_context::build_attn_mha()
, which avoids the extra overhead of 3D batched matrix multiplication and just converts into normal 2D matrix multiplication (this is my only real contribution to this - the rest is all @fairydreaming's work!).How it works:
1. Inside of
convert_hf_to_gguf.py
we alter the metadata to make the new MLA GGUF files appear to be MQAso for all intents and purposes, the rest of
llama.cpp
will now see these new GGUF files as being MQA:We also add two new bits of metadata that we need to be able to "decompress" MQA back into MHA at the end:
and these hold the original
k_head_dim
andv_head_dim
of the model:2. We add an extra tensor called
v_mla
to the parameters ofllm_graph_context::build_attn_mha()
This gets used to "decompress" MQA back into MHA like so:
This, and the function signature of
llm_graph_context::build_attn()
used to call this, are the only real changes needed tollama.cpp
's code outside of thedeepseek2
-specific stuff inllama-model.cpp
. I think this is the cleanest / most-maintainable way to add this MLA support.3. When loading the tensors we will load only the legacy
wkv_b
tensor or the new splitwk_b
andwv_b
tensors4. Inside of
llm_build_deepseek2()
we can treat as MHA as the old code did, or MQA as the new code5. To get context-shifting to work
A. Ensure the
RoPE
part goes first and theNoPE
part goes secondggml_tensor * q_states = ggml_concat(ctx0, q_pe, q_nope_absorbed, 0);
and:
ggml_tensor * q_states = ggml_concat(ctx0, q_pe, q_nope, 0);
B. Apply the same scaling to
yarn_attn_factor
inside ofllama_context::build_rope_shift()
as is applied inside of
llm_build_deepseek2()
:NOTES
attn_k_b
andattn_v_b
tensors to useBF16
inllama_tensor_get_type()
you'll likely get pretty horrible CUDA performance. It seems that CUDA just does not like the 3D matrix multiplies these need when they are quantised.BF16
for these then it is also worth patchingggml/src/ggml-cuda/ggml-cuda.cu
to use<= MMV_MAX_ROWS
instead of< MMV_MAX_ROWS
or just settingMMV_MAX_ROWS = 513
. This gains me quite a bit by using the optimised MMV branch instead of the generalCuBLAS
branch.-mla
option likeik_llama.cpp
uses, but it just ended up a real mess (see previious PR).wkv_b
tensor and just slice it to gtwv_b
, but it wasn't aligned properly and would have needed a copy/cont every time we accessed it, making keeping it pointless...@ngxson Please don't merge this yet as I have left a placeholder function that needs to be removed first:
I don't want to make the changes to all the calls to
build_attn()
inllama-model.cpp
until we're happy with the code or else it could end up a nightmare having to rebase it if others add new models whilst waiting for this to be reviewed!I've tested this so far on:
deepseek-v2-lite
: both legacy GGUF files and new GGUF files quantized and inBF16
.deepseek-r1
: only the new GGUF files quantized and inBF16
.