kv-cache : prepare K/V buffers for separation #14517

ggerganov · 2025-07-03T12:55:17Z

Currently, the K and V buffers in the unified KV cache are shared among all the participating sequences (hence the name "unified"). With the upcoming change #14363, the buffers can become separate from each other in order to increase the throughput for parallel decoding use cases. This PR is a preparation step to support that.

There should be no functional changes.

Handling of variable V heads is also done when ggml_set_rows() is used.

LLAMA_SET_ROWS=1 ./bin/llama-cli -hf mradermacher/OpenELM-3B-Instruct-GGUF:Q8_0 \
  -p "I believe the meaning of life is" -no-cnv -n 32 -t 1 -s 2 --top-k 1

Outdated

The only new restriction is that we require the number of KV heads for all layers to be equal:

llama.cpp/src/llama-kv-cache-unified.cpp

Lines 70 to 77 in 40f8c48

    
           if (supports_set_rows) { 
        
               // TODO: this requirement can be relaxed, but it would be much easier to implement when we have an actual 
        
               //       model that needs this 
        
               // ref: https://github.com/ggml-org/llama.cpp/pull/14517 
        
               GGML_ASSERT(hparams.is_n_embd_v_gqa_homogeneous()); 
        
           }

Support for varying number of KV heads should be simple - just need to make the correct view of v_idxs when FA is disabled. But leaving this for when we actually need it.

compilade · 2025-07-03T15:52:16Z

src/llama-kv-cache-unified.cpp

+        // TODO: this requirement can be relaxed, but it would be much easier to implement when we have an actual
+        //       model that needs this
+        // ref: https://github.com/ggml-org/llama.cpp/pull/14517
+        GGML_ASSERT(hparams.is_n_embd_v_gqa_homogeneous());


I think OpenELM is a model family which needs this, see #7359

This is now fixed with the latest commit.

ggml-ci

ExtReMLapin · 2025-07-04T14:55:42Z

ref, Sounds related to #10860

ggerganov · 2025-07-04T16:23:01Z

#14363 is more relevant. This PR is a standalone preparation step that I extracted to make the final PR easier to review.

ggerganov force-pushed the gg/kv-cache-prepare-separation branch from 2a738fe to 40f8c48 Compare July 3, 2025 12:57

compilade reviewed Jul 3, 2025

View reviewed changes

kv-cache : prepare K/V buffers for separation

886da0a

ggml-ci

ggerganov force-pushed the gg/kv-cache-prepare-separation branch from 386425f to 886da0a Compare July 4, 2025 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : prepare K/V buffers for separation #14517

kv-cache : prepare K/V buffers for separation #14517

ggerganov commented Jul 3, 2025 •

edited

Loading

Uh oh!

compilade Jul 3, 2025

Uh oh!

ggerganov Jul 3, 2025

Uh oh!

ExtReMLapin commented Jul 4, 2025

Uh oh!

ggerganov commented Jul 4, 2025

Uh oh!

Uh oh!


	if (supports_set_rows) {
	// TODO: this requirement can be relaxed, but it would be much easier to implement when we have an actual
	// model that needs this
	// ref: https://github.com/ggml-org/llama.cpp/pull/14517
	GGML_ASSERT(hparams.is_n_embd_v_gqa_homogeneous());
	}

kv-cache : prepare K/V buffers for separation #14517

Are you sure you want to change the base?

kv-cache : prepare K/V buffers for separation #14517

Conversation

ggerganov commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

compilade Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

ExtReMLapin commented Jul 4, 2025

Uh oh!

ggerganov commented Jul 4, 2025

Uh oh!

Uh oh!

ggerganov commented Jul 3, 2025 •

edited

Loading