Skip to content

kv-cache : prepare K/V buffers for separation #14517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 3, 2025

from #14363

Currently, the K and V buffers in the unified KV cache are shared among all the participating sequences (hence the name "unified"). With the upcoming change #14363, the buffers can become separate from each other in order to increase the throughput for parallel decoding use cases. This PR is a preparation step to support that.

There should be no functional changes.

Handling of variable V heads is also done when ggml_set_rows() is used.

LLAMA_SET_ROWS=1 ./bin/llama-cli -hf mradermacher/OpenELM-3B-Instruct-GGUF:Q8_0 \
  -p "I believe the meaning of life is" -no-cnv -n 32 -t 1 -s 2 --top-k 1
Outdated

The only new restriction is that we require the number of KV heads for all layers to be equal:

if (supports_set_rows) {
// TODO: this requirement can be relaxed, but it would be much easier to implement when we have an actual
// model that needs this
// ref: https://github.com/ggml-org/llama.cpp/pull/14517
GGML_ASSERT(hparams.is_n_embd_v_gqa_homogeneous());
}

Support for varying number of KV heads should be simple - just need to make the correct view of v_idxs when FA is disabled. But leaving this for when we actually need it.

@ggerganov ggerganov force-pushed the gg/kv-cache-prepare-separation branch from 2a738fe to 40f8c48 Compare July 3, 2025 12:57
Comment on lines 72 to 75
// TODO: this requirement can be relaxed, but it would be much easier to implement when we have an actual
// model that needs this
// ref: https://github.com/ggml-org/llama.cpp/pull/14517
GGML_ASSERT(hparams.is_n_embd_v_gqa_homogeneous());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OpenELM is a model family which needs this, see #7359

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed with the latest commit.

@ggerganov ggerganov force-pushed the gg/kv-cache-prepare-separation branch from 386425f to 886da0a Compare July 4, 2025 07:13
@ExtReMLapin
Copy link
Contributor

ref, Sounds related to #10860

@ggerganov
Copy link
Member Author

#14363 is more relevant. This PR is a standalone preparation step that I extracted to make the final PR easier to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants