kv-cache : prepare K/V buffers for separation #14517
Open
+127
−36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
from #14363
Currently, the K and V buffers in the unified KV cache are shared among all the participating sequences (hence the name "unified"). With the upcoming change #14363, the buffers can become separate from each other in order to increase the throughput for parallel decoding use cases. This PR is a preparation step to support that.
There should be no functional changes.
Handling of variable V heads is also done when
ggml_set_rows()
is used.LLAMA_SET_ROWS=1 ./bin/llama-cli -hf mradermacher/OpenELM-3B-Instruct-GGUF:Q8_0 \ -p "I believe the meaning of life is" -no-cnv -n 32 -t 1 -s 2 --top-k 1
Outdated
The only new restriction is that we require the number of KV heads for all layers to be equal:
llama.cpp/src/llama-kv-cache-unified.cpp
Lines 70 to 77 in 40f8c48
Support for varying number of KV heads should be simple - just need to make the correct view of
v_idxs
when FA is disabled. But leaving this for when we actually need it.