[RFC]: Sparse KV cache management framework

### Motivation.

Efficient use of KV cache not only saves scarce GPU memory but also accelerates speed. Evicting potentially many low-utilization tokens, especially in long context, is the way to do it. To this end, this RFC proposes a flexible and performant KV cache management framework.

### Proposed Change.

The design is as the following figure shows:

<img width="1119" alt="Image" src="https://github.com/user-attachments/assets/2702c509-937c-410f-a6b3-6c2f59f68e24" />

(1) CachePolicy is a general interface that manages KV cache allocations. It generalizes from basic cache behavior that always allocates new slots/blocks to append new tokens to the more sophisticated that dictates how cache space is used should there be new tokens, such as sliding window, H2O, FastGen, etc.

(2) CachePolicy (modified from previous BlockTable) has two straightforward methods: add_tokens_prefill and add_tokens_decode to give slots/blocks to new tokens. The given slots/blocks may be new allocations, and if so they will be requested from allocator. Or they can be old ones, which are from the previous blocks at its disposal. It is the policy that selects victims, evicts them, and replaces them with the new ones.

(3) To do so, two data structures are employed to keep track of blocks and tokens: a PhysicalBlockTable and a VirtualBlockTable, and maintains these two to manage the cache space. The PhysicalBlockTable (previously BlockList) holds the physical block IDs, while the VirtualBlockTable maps tokens to blocks and backward with slot mappings and token mappings respectively.

(4) Implemented first two concrete policies which are used today, the defaultCachePolicyBase and CachePolicySlidingWindow, which keeps context of sliding window. Specifically, it is reimplemented such that the sliding window takes effect in prefill and is adhered to in decode, in which never a new block will be allocated should a window of blocks have been had.

(5) Per layer cache management capability.

(6) Attention weights from attention backend.

This work has a draft PR https://github.com/vllm-project/vllm/pull/11928. It was worked on V0 and will be revamped on V1 in the following weeks.

Tasks:

1. Unify add_tokens_prefill and add_tokens_decode (which was suggested by @comaniac)
2. Abstract new cache policy interface and develop the supporting data structures.
3. Wait for and coordinate with Hybrid Memory Allocator https://github.com/vllm-project/vllm/issues/11382
4. Getting attention weights from backend.

### Feedback Period.

2 weeks.

There are efforts already there, like https://github.com/vllm-project/vllm/issues/10646 and https://github.com/vllm-project/vllm/issues/10942. Welcome to discuss and this work is open and can be adjusted by existing work.

### CC List.

@comaniac @simon-mo @WoosukKwon @youkaichao @heheda12345 @YaoJiayi @Zachary-ai-engineer

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Sparse KV cache management framework #12254

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Sparse KV cache management framework #12254

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions