-
Notifications
You must be signed in to change notification settings - Fork 11.8k
DeepSeek V2/V3 with -mla
option
#12725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSeek V2/V3 with -mla
option
#12725
Conversation
I've tidied things up to use // for MQA (ie: GQA with 1 group) we don't need to use a batched matrix multiply
if (n_head_kv == 1) {
q = ggml_view_2d(ctx0, q,
n_embd, n_tokens*n_head,
ggml_row_size(q->type, n_embd),
0);
}
ggml_tensor * kq = ggml_mul_mat(ctx0, k, q);
// note: this op tends to require high floating point range
// while for some models F16 is enough, for others it is not, so we default to F32 here
ggml_mul_mat_set_prec(kq, GGML_PREC_F32);
if (n_head_kv == 1) {
kq = ggml_view_3d(ctx0, kq,
n_kv, n_tokens, n_head,
ggml_row_size(kq->type, n_kv),
ggml_row_size(kq->type, n_kv)*n_tokens,
0);
}
I don't think I can really improve things much more, and the ugliness in |
Can't we just derive these from |
From user perspective, this can be a bad experience if user updates llama.cpp to latest version and suddenly the model no longer work. I think what we can do is to add a new gguf metadata like |
I've found a subtle bug which only happens when you use a speculative decoding model: for (int i = 0; i < n_layer; i++) {
int64_t n_embd_k;
int64_t n_embd_v;
// note: deepseek with MLA option converts into MQA (ie: GQA with 1 group)
if (cparams.mla_attn) {
n_embd_k = hparams.n_lora_kv + hparams.n_rot;
n_embd_v = hparams.n_lora_kv;
} else {
n_embd_k = hparams.n_embd_k_gqa(i) + hparams.n_embd_k_s();
n_embd_v = hparams.n_embd_v_gqa(i) + hparams.n_embd_v_s();
}
const char * dev_name = "CPU";
ggml_backend_buffer_type_t buft;
if (offload) {
auto * dev = model.dev_layer(i);
buft = ggml_backend_dev_buffer_type(dev);
dev_name = ggml_backend_dev_name(dev);
} else {
buft = ggml_backend_cpu_buffer_type();
}
LLAMA_LOG_DEBUG("%s: layer %3d: n_embd_k = %" PRId64 ", n_embd_v = %" PRId64 ", dev = %s\n", __func__,
i, n_embd_k, n_embd_v, dev_name);
ggml_context * ctx = ctx_for_buft(buft);
if (!ctx) {
LLAMA_LOG_ERROR("%s: failed to create ggml context for kv cache\n", __func__);
return false;
}
ggml_tensor * k = ggml_new_tensor_1d(ctx, type_k, n_embd_k*kv_size);
ggml_tensor * v = ggml_new_tensor_1d(ctx, type_v, n_embd_v*kv_size);
ggml_format_name(k, "cache_k_l%d", i);
ggml_format_name(v, "cache_v_l%d", i);
k_l.push_back(k);
v_l.push_back(v);
} The problem is that This logic in if (params.mla_attn && model->arch != LLM_ARCH_DEEPSEEK2) {
LLAMA_LOG_WARN("%s: mla_attn is only compatible with Deepseek2 - forcing off\n", __func__);
params.mla_attn = false;
}
if (params.flash_attn && params.mla_attn) {
LLAMA_LOG_WARN("%s: flash_attn is not compatible with mla_attn - forcing off\n", __func__);
params.flash_attn = false;
} (at least from what is printed) The solution for this case is just to check for if (cparams.mla_attn && model.arch == LLM_ARCH_DEEPSEEK2) {
n_embd_k = hparams.n_lora_kv + hparams.n_rot;
n_embd_v = hparams.n_lora_kv;
} else {
n_embd_k = hparams.n_embd_k_gqa(i) + hparams.n_embd_k_s();
n_embd_v = hparams.n_embd_v_gqa(i) + hparams.n_embd_v_s();
} But I also noticed that the v_trans = !recurrent && !cparams.flash_attn; uint32_t llama_kv_cache_unified::get_padding(const llama_cparams & cparams) const {
// the FA kernels require padding to avoid extra runtime boundary checks
return cparams.flash_attn ? 256u : 32u;
} and wonder if the same subtle bug is happening when using a flash attention allowed draft model with a non-allowed model like LLM_ARCH_GROK? I think the if (params.flash_attn && model->arch == LLM_ARCH_GROK) {
LLAMA_LOG_WARN("%s: flash_attn is not compatible with Grok - forcing off\n", __func__);
params.flash_attn = false;
} might need investigating to check that setting it was an extremely subtle bug to track down (I thought it was my speculative model that was broken until I ran it on its own and using |
Not sure what you mean? :) We can't have it in the file as I don't really want to have some complex code that tries to guess the allowed operations for a given back-end to decide on the slicing type either as I think this will be way too brittle to be effective. Whatever the final solution of the PR is I can say for sure I will be patching it to use |
I meant you can most likely quantize
There are also some incoming updates on that for both CUDA and Vulkan. :) |
I agree that no one will want to use the non-MLA version, but the problem is that some users don't know what is MLA, and all they care about is that updating to a newer version of llama.cpp should not break their existing model. Indeed, I'm thinking about another approach that is a bit stupidly simple: The current arch name is Re. having Many archs are using this trick to detect if certain bias tensors are present or not, we can just do the same! |
@fairydreaming suggested this when he first put forward his MLA PR, but @ggerganov wasn't keen on the idea: but I agree it is an appealing solution that saves a lot of hassle? |
Yeah, this was exactly my first idea and I originally added this: llama.cpp/convert_hf_to_gguf.py Line 333 in b4c169f
but reverted it when I found even
Interesting!? I think there is a lot of potential for using I have an RTX PRO 6000 Blackwell on order, so hope to retry some of the stuff with CUDA and |
I'm not gonna rush to make any changes to the sliced tensors yet anyway and hopefully we can get more feedback on what is the best direction to go (just praying we don't have another refactoring like last time as the |
Ok sorry I was missing the context, thanks for pointing me to the correct discussion. Both @fairydreaming and Georgi (not pinging you here to reduce a bit of noise) make good points, regarding the fact that MLA is not something "for free" but can affect the performance and memory usage in a way that we don't yet know. So adding But I still don't want to break existing quants. As deepseek v3 are huge, not everyone can requant it. I imagine even @bartowski1182 won't want this to happen, so let's try not to have a breaking change I guess? |
It would certainly be nice if we could avoid breaking changes, but I'm of the opinion that if progress necessitates breakage, let's break it Koboldcpp does work to maintain (too much) backwards compatibility for those that need it, people can run an older llama.cpp while waiting for update as well By all means I would vastly prefer we keep it compatible, but I won't cry if they have to be remade so that we can avoid future maintenance headaches |
{LLM_TENSOR_DEC_ATTN_Q, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_DEC_ATTN_K, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_Q, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_K, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_V, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_QKV, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_OUT, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_GATE, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_DOWN, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_UP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_DOWN_SHEXP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_GATE_SHEXP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_FFN_UP_SHEXP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_Q_A, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_Q_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_KV_A_MQA, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, | ||
{LLM_TENSOR_ATTN_KV_B, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are deleted inadvertently? For example, ffn_*_shexp are still used by qwen moe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these were all accidentally duplicated in the main branch so I removed the duplicates when inserting the new ones.
I'll try and explore some options over the weekend to allow for backwards compatibility. If we're keeping the We can leave the actual final decision on what to do about the splitting, duplicates, |
convert_hf_to_gguf.py
Outdated
return [ | ||
(self.map_tensor_name(name), data_torch), | ||
(self.map_tensor_name(name_kb), k_b), | ||
(self.map_tensor_name(name_vb), v_b) | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please tell me if I missed something regarding the discussion about duplicate or not to duplicate these tensor (slices).
About the subject of not to duplicate it, I'm thinking about an idea that could allow slicing kv_b_proj
at load without using too much memory, is to do something like this:
model.wk_b = nullptr; // at earlier load stage, we don't have this
// then during warmup
model.wk_b = ggml_view_2d(wkv_b, qk_nope_head_dim,...);
model.wv_b = ggml_view_2d(wkv_b, v_head_dim,...);
// transpose wk_b then copy back to initial memory
ggml_tensor * wk_b_T = ggml_cont(ggml_transpose(model.wk_b));
model.wk_b = ggml_cpy(wk_b_T, model.wk_b); // not even sure if this would work
The ggml_cpy
and ggml_view
won't allocate new memory on device buffer, only ggml_cont
need to allocate memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we could easily do this (or something similar) so long as we keep kv_proj_b
as a float32
, but this has different problems:
- We're now forcing those not using the
-mla
option to havekv_proj_b
stored asfloat32
when they don't really need it and can just use theggml_mul_mat_set_prec(xxx, GGML_PREC_F32)
call instead. - This won't work for existing quantised models as they are stored row-major in memory and we'd have to dequantise, requantise and hope the alignment is right (and it will have a row length of 128 so also won't work on any of the non-legacy quants).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a good point though: I don't think we really need to save wv_b
at all and can just use the upper slice of wkv_b
(I think - will have to check tomorrow).
Setting this back to draft whilst I make the changes. |
@ngxson (and others) What do you think of this now I've removed the extra // {n_embd_head_v, n_head, n_tokens}
ggml_tensor * wv_b = ggml_view_3d(ctx0, model.layers[il].wkv_b,
kv_lora_rank, n_embd_head_v, n_head,
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank),
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank) * (n_embd_head_qk_nope + n_embd_head_v),
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank) * n_embd_head_qk_nope);
cb(wv_b, "wv_b", il); Then we add a I've tested this on You get this message if you try to use
I'm not sure what the performance impact of using the I'm quantising the full |
@jukofyork sorry to bother, but any chance to try to do a small quant of V3? Like Q2_K_XL size or similar. I want to test CPU (192GB RAM) + 4 CUDA GPUs (128GB VRAM), since when I use -mla from ik_llamacpp fork, I get gibberish output, ikawrakow/ik_llama.cpp#305 (and tested with a model quanted with ik_llamacpp, DeepSeek-V3-0324-IQ2_K_R4). When not using the flag, works fine. |
DeepSeek-V2-Lite-ChatThese are all after I removed the with
|
I'll try and test it tomorrow, but at least for the "lite" model above; |
This works, but gives truly horrible performance on the CUDA back-end: with
|
I'll make a branch with this alternative so other's can compare:
|
It's here: https://github.com/jukofyork/llama.cpp/tree/mainline-llama-cpp-master--mla--f32 but this adds around I say we just live with the horrible CUDA performance for now and go with @JohannesGaessler's suggestion:
and when he gets time he can maybe figure out what is causing this for quantised batched matrix multiplies like I also can't prove the The newly created MLA-quants will also have the extra Opening this PR back up for review, as I don't think I can really do any better than this. |
As a workaround for the CUDA performance, then you should be able to adapt this script for your own personal quants: #!/bin/bash
function safe_sed() {
local file=$1
local pattern=$2
local replacement=$3
# Check if pattern exists
if ! sed -n "s/${pattern}/${replacement}/p" "$file" | grep -q .; then
echo "Error: Pattern not found in $file: $pattern"
return 1
fi
# Create backup
cp "$file" "$file.bak"
# Perform the replacement
sed -i "s/${pattern}/${replacement}/g" "$file"
# Show diff
echo "Changes in $file:"
diff "$file.bak" "$file"
# Clean up
rm "$file.bak"
echo "Successfully replaced in $file"
echo "-------------------"
}
function safe_sed_function() {
local file=$1
local function_signature=$2
local replacement=$3
# Create backup
cp "$file" "$file.bak"
# Perform the replacement using address range and c command
sed -i "${function_signature}/,/^}/c\\${replacement}" "$file"
# Clean up
rm "$file.bak"
echo "Successfully replaced function in $file"
echo "-------------------"
}
rm -rf llama.cpp
git clone https://github.com/jukofyork/llama.cpp --branch mainline-llama-cpp-master--mla
cd llama.cpp
# For attn_v_b to use fast mmv call.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "< MMV_MAX_ROWS" "<= MMV_MAX_ROWS"
# Don't offload these huge tensors to GPU as PCI-E transfer is slower than just just using CPU.
safe_sed "ggml/src/ggml-cuda/ggml-cuda.cu" "const int min_batch_size = 32" "const int min_batch_size = 9999999"
# Hack llama_tensor_get_type() to use our custom quant.
safe_sed_function "src/llama-quant.cpp" \
"/^static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor \\* tensor, llama_ftype ftype) {" \
"static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {\n\
const std::string name = ggml_get_name(tensor);\n\
if (name.find(\"attn_kv_b\") != std::string::npos || name.find(\"attn_k_b_trans\") != std::string::npos) {\n\
return GGML_TYPE_BF16;\n\
}\n\
return GGML_TYPE_Q8_0;\n\
}"
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON
cmake --build build --config Release -- -j 44 You may not want to patch I'm just checking now what difference (if any) this runtime slicing of // {n_embd_head_v, n_head, n_tokens}
ggml_tensor * wv_b = ggml_view_3d(ctx0, model.layers[il].wkv_b,
kv_lora_rank, n_embd_head_v, n_head,
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank),
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank) * (n_embd_head_qk_nope + n_embd_head_v),
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank) * n_embd_head_qk_nope); |
So it turns out you can't do this slice anyway: // {n_embd_head_v, n_head, n_tokens}
ggml_tensor * wv_b = ggml_view_3d(ctx0, model.layers[il].wkv_b,
kv_lora_rank, n_embd_head_v, n_head,
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank),
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank) * (n_embd_head_qk_nope + n_embd_head_v),
ggml_row_size(model.layers[il].wkv_b->type, kv_lora_rank) * n_embd_head_qk_nope);
and if you try to
Sorry guys, but I can't waste any more time on this as each of these changes is taking several hours to re-quant all the models to test with, so I'm just gonna go back to storing a copy of |
Continued in #12772 |
This PR adds the
-mla
option (long name--mla-attn
) and is a continuation of @fairydreaming's #11446 PR.The quants created for @fairydreaming's PR should all still work fine, but you won't be able to use without requantising if you have a GGUF without the
attn_k_b
andattn_v_b
tensors this or @fairydreaming adds.I've set these two to use
F32
for MLA and non-MLA respectively:ggml_tensor * q_nope_absorbed = ggml_mul_mat(ctx0, wk_b, q_nope); ggml_mul_mat_set_prec(q_nope_absorbed, GGML_PREC_F32);
ggml_tensor * kv = ggml_mul_mat(ctx0, model.layers[il].wkv_b, kv_cmpr); ggml_mul_mat_set_prec(kv, GGML_PREC_F32);
This has been tested to fix the weirdness I was getting for the non-MLA version which has
wkv_b
stored asQ8_0
, but my MLA quant haswk_b
stored asBF16
so can't test that yet.These may cause some regression in performance as a result, but this is consistent with the way that
llm_graph_context::build_attn_mha()
handles this unconditionally:q_nope_absorbed
as thiskq
calculation, but transferred to this operation instead of the actualkq
calculation which now happens in the "compressed" space instead.kv = ggml_mul_mat(ctx0, model.layers[il].wkv_b, kv_cmpr)
operation in terms ofkq
, but it can be tested to produce gibberish without this fix by just asking "Tell me a joke about pandas" and seeing it doesn't output any newlines and eventually just stops abruptly...I've done my best to cleanly add the new code to the newly refactored
llama-graph.cpp
code,but sadly it's not possible to integrate withbuild_attn_mha()
as:There's no way to decompress usingwv_b
asbuild_attn_mha()
applies thekqv_merged
andcont
inside.We lose the performance gain from using 2D views when applying MQA (ie: GQA with 1 group).To fix this I would have to start moving the(fixed now - see below)wo
stuff out ofbuild_attn_mha()
and/or create a newbuild_attn_mqa()
function.This still looks a bit ugly in
llama_kv_cache_unified::init()
:but I can't see a better way to do it currently.
I'll leave it as a draft for now and would welcome feedback, as I can't easily test this for other back-ends or without the newly added "offload tensor" stuff.