Skip to content

[RFC]: Sparse KV cache management framework #12254

@ShawnD200

Description

@ShawnD200

Motivation.

Efficient use of KV cache not only saves scarce GPU memory but also accelerates speed. Evicting potentially many low-utilization tokens, especially in long context, is the way to do it. To this end, this RFC proposes a flexible and performant KV cache management framework.

Proposed Change.

The design is as the following figure shows:

Image

(1) CachePolicy is a general interface that manages KV cache allocations. It generalizes from basic cache behavior that always allocates new slots/blocks to append new tokens to the more sophisticated that dictates how cache space is used should there be new tokens, such as sliding window, H2O, FastGen, etc.

(2) CachePolicy (modified from previous BlockTable) has two straightforward methods: add_tokens_prefill and add_tokens_decode to give slots/blocks to new tokens. The given slots/blocks may be new allocations, and if so they will be requested from allocator. Or they can be old ones, which are from the previous blocks at its disposal. It is the policy that selects victims, evicts them, and replaces them with the new ones.

(3) To do so, two data structures are employed to keep track of blocks and tokens: a PhysicalBlockTable and a VirtualBlockTable, and maintains these two to manage the cache space. The PhysicalBlockTable (previously BlockList) holds the physical block IDs, while the VirtualBlockTable maps tokens to blocks and backward with slot mappings and token mappings respectively.

(4) Implemented first two concrete policies which are used today, the defaultCachePolicyBase and CachePolicySlidingWindow, which keeps context of sliding window. Specifically, it is reimplemented such that the sliding window takes effect in prefill and is adhered to in decode, in which never a new block will be allocated should a window of blocks have been had.

(5) Per layer cache management capability.

(6) Attention weights from attention backend.

This work has a draft PR #11928. It was worked on V0 and will be revamped on V1 in the following weeks.

Tasks:

  1. Unify add_tokens_prefill and add_tokens_decode (which was suggested by @comaniac)
  2. Abstract new cache policy interface and develop the supporting data structures.
  3. Wait for and coordinate with Hybrid Memory Allocator [RFC]: Hybrid Memory Allocator #11382
  4. Getting attention weights from backend.

Feedback Period.

2 weeks.

There are efforts already there, like #10646 and #10942. Welcome to discuss and this work is open and can be adjusted by existing work.

CC List.

@comaniac @simon-mo @WoosukKwon @youkaichao @heheda12345 @YaoJiayi @Zachary-ai-engineer

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

RFCstaleOver 90 days of inactivity

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions