Efficient Memory Management for Large Language Model Serving with PagedAttention

这两年最喜欢的一个工作，非常有系统味道。在 Transformer 的 QKV 的优化上，KVCache 是一个常见且非常有效的手段。KVCache 为什么是行之有效的优化手段，可以看[这篇文章](https://zhuanlan.zhihu.com/p/717581669)，不过需要先读过 Attention is all you need 后才能看的明白。

KVCache 的本质是拿空间换时间，于是显存就成了一个限制。PagedAttention 的核心观察是在这个过程里 GPU 显存的使用模式很像是 CPU 的内存，有非常多的碎片。在操作系统里为了避免这个问题，虚拟内存被引入。进程不再需要内存是连续的，那么显存也可以通过类似的做法来减少这一问题，进而提高显存的“利用率”

![image](https://github.com/user-attachments/assets/cc32cb77-6d20-4f27-9e3f-1c4a55414423)

Paged Attention 将每个生成序列的 KV Cache 划分为多个块（block），每个块包含固定数量的 key 和 value 向量。这些块在物理存储上不需要是连续的。当进行 attention 计算时，可以通过块表（block table）查找对应序列的块，从而提取所需的 key 和 value 向量。跟内存中的 virtual memory 与 physical memory 异曲同工之妙。block table 就像页表，KV block manager 起到了管理 block table 的作用。

要不怎么都想做 LLM Sys，确实有做出这样的工作的机会。


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Efficient Memory Management for Large Language Model Serving with PagedAttention #298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Efficient Memory Management for Large Language Model Serving with PagedAttention #298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions