|
| 1 | +Design and Architecture |
| 2 | +======================= |
| 3 | + |
| 4 | +Introduction |
| 5 | +------------ |
| 6 | + |
| 7 | +Motivation |
| 8 | +~~~~~~~~~~ |
| 9 | +LLM inference is moving to disaggregated architecture. |
| 10 | +LLM inference is moving from single-instance execution to cluster-level disaggregated architecture. Among all the efforts, prefill-decoding disaggregation is probably the most prominent change. The prefill phase requires more computational power, while the decode phase places a greater demand for memory. With this observation, prefill and decode phase disaggregation is an important aspect to improve inference engine performance. |
| 11 | +In addition to prefill-decode disaggregation, distributed KV cache could also increase the prefix KV cache hit rate, leading to higher GPU resource utilization. |
| 12 | +There are various related papers in this field, and some of them are even already in production: |
| 13 | + |
| 14 | + - Mooncake: Kimi's production serving platform. A global KV store is made up of distributed DDR and SSD on each GPU host. |
| 15 | + - Splitwise: A prefill-decode disaggregation system, which requires KV cache transfer between different machines. |
| 16 | + - AttentionStore: Similar to Mooncake but it considers multi-turn conversation inference with positional-encoding separation from KV cache on a single node. |
| 17 | + - MemServe: An elastic memory pool managing distributed memory and KV caches across serving instances. |
| 18 | + |
| 19 | +We identified many innovative or potential improvements in this transition. |
| 20 | +While analyzing the works above, we identified many potential improvements or new techniques to build a high-performance and scale cluster-level inference system, such as: |
| 21 | + |
| 22 | + - Improvements on the request schedulers to build a more extensible and scalable scheduler, |
| 23 | + - Integrating with specific inference engine features (like extending the existing APC feature in vLLM), |
| 24 | + - Some new algorithms to better scale the memory pool and re-balance the hot sequences, |
| 25 | + - Exploring some new techniques such as de-coupled positional encoding, etc. |
| 26 | + |
| 27 | +We are trying to build a high-performance open-source implementation to incorporate all the potential innovations mentioned above, so that different customers don't have to build their own. |
| 28 | + |
| 29 | + |
| 30 | +Features |
| 31 | +-------- |
| 32 | + |
| 33 | +Compared to a single instance vLLM, vLLM + InfiniStore supports the following new features: |
| 34 | + |
| 35 | +- Prefill-decoding architecture |
| 36 | +- Historical KV cache in DRAM and SSD: a much larger pool than the current Automatic Prefix Cache (APC) feature in vLLM which is limited to GPU HBM. |
| 37 | +- Cross-host KV cache: one host can reuse the historical KV cache on another host. |
| 38 | + |
| 39 | + |
| 40 | +Architecture |
| 41 | +------------ |
| 42 | + |
| 43 | +.. image:: img/arch.png |
| 44 | + :align: center |
| 45 | + |
| 46 | +1. Infinistore and vLLM are deployed on the same server, reusing the local CPU and memory resources. |
| 47 | + |
| 48 | +2. The memcopy speed within the same machine is significantly faster than RDMA. It is recommended to use local GPU copy when reading and writing to the local Infinistore. |
| 49 | + |
| 50 | +3. Infinistore uses the traditional key-value structure, supporting variable-length keys. This facilitates storing information like model_id, request, and token hash in the key. |
| 51 | + Since RDMA memory registration is very slow, Infinistore pre-registers memory for RDMA during startup and implements memory management using a memory pool. |
| 52 | + The current memory management algorithms support bitmap or jemalloc, with bitmap being the default. |
| 53 | + |
| 54 | +4. Read and Write Process: |
| 55 | + |
| 56 | + a. Prefill Stage: |
| 57 | + vLLM writes to the kvcache layer by layer during the prefill stage. Communication methods can be either local GPU copy or RDMA. |
| 58 | + Practical experience shows that the layer-by-layer approach parallelizes network communication and GPU computation. Measurements indicate that during the prefill stage, the network overhead increases by no more than 1%. |
| 59 | + For a demo implementation, refer to: demo_prefill.py |
| 60 | + |
| 61 | + b. Decode Stage: |
| 62 | + In the decode stage, a separate thread in vLLM downloads the kvcache and then notifies the scheduler to start the decoding process. |
| 63 | + Unlike the current community implementation of vLLM, to ensure that network operations do not block the GPU during the decode stage, an additional thread is required to download data. |
| 64 | + |
| 65 | +Communications |
| 66 | +-------------- |
| 67 | + |
| 68 | + |
| 69 | +local gpu copy |
| 70 | +~~~~~~~~~~~~~~ |
| 71 | + |
| 72 | +.. image:: img/local_gpu_cpy.png |
| 73 | + :align: center |
| 74 | + |
| 75 | + |
| 76 | +rdma write |
| 77 | +~~~~~~~~~~ |
| 78 | + |
| 79 | +.. image:: img/rdma_write.png |
| 80 | + :align: center |
| 81 | + |
| 82 | +rdma read |
| 83 | +~~~~~~~~~ |
| 84 | + |
| 85 | + |
| 86 | +.. image:: img/rdma_read.png |
| 87 | + :align: center |
0 commit comments