Skip to content

Commit eb44a8a

Browse files
thesuesdongmao zhang
and
dongmao zhang
authored
add gh-pages (#95)
Co-authored-by: dongmao zhang <[email protected]>
1 parent 4c43c6d commit eb44a8a

15 files changed

+399
-7
lines changed

.github/workflows/deploy-docs.yml

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
name: Deploy Documentation to GitHub Pages
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
- setup-pages
8+
9+
jobs:
10+
build-and-deploy:
11+
runs-on: ubuntu-latest
12+
13+
steps:
14+
- name: Checkout repository
15+
uses: actions/checkout@v3
16+
17+
- name: Setup Python
18+
uses: actions/setup-python@v4
19+
with:
20+
python-version: '3.x'
21+
22+
- name: Install dependencies
23+
run: |
24+
pip install sphinx==6.2.1
25+
26+
- name: Build documentation
27+
run: |
28+
cd docs
29+
make html
30+
31+
- name: Deploy to GitHub Pages
32+
uses: peaceiris/actions-gh-pages@v3
33+
with:
34+
github_token: ${{ secrets.GITHUB_TOKEN }}
35+
publish_dir: docs/build/html
36+
allow_empty_commit: true
37+
keep_files: true

docs/Makefile

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = source
9+
BUILDDIR = build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/make.bat

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
@ECHO OFF
2+
3+
pushd %~dp0
4+
5+
REM Command file for Sphinx documentation
6+
7+
if "%SPHINXBUILD%" == "" (
8+
set SPHINXBUILD=sphinx-build
9+
)
10+
set SOURCEDIR=source
11+
set BUILDDIR=build
12+
13+
%SPHINXBUILD% >NUL 2>NUL
14+
if errorlevel 9009 (
15+
echo.
16+
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17+
echo.installed, then set the SPHINXBUILD environment variable to point
18+
echo.to the full path of the 'sphinx-build' executable. Alternatively you
19+
echo.may add the Sphinx directory to PATH.
20+
echo.
21+
echo.If you don't have Sphinx installed, grab it from
22+
echo.https://www.sphinx-doc.org/
23+
exit /b 1
24+
)
25+
26+
if "%1" == "" goto help
27+
28+
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29+
goto end
30+
31+
:help
32+
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33+
34+
:end
35+
popd

docs/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
sphinx==6.2.1

docs/source/api.rst

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
API Reference
2+
==============
3+
4+
.. automodule:: infinistore
5+
:members:

docs/source/conf.py

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Configuration file for the Sphinx documentation builder.
2+
#
3+
# For the full list of built-in configuration values, see the documentation:
4+
# https://www.sphinx-doc.org/en/master/usage/configuration.html
5+
6+
# -- Project information -----------------------------------------------------
7+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
8+
9+
10+
import sys
11+
import os
12+
13+
sys.path.insert(0, os.path.abspath("../.."))
14+
15+
project = "infinistore"
16+
copyright = "2025, [email protected]"
17+
18+
19+
# -- General configuration ---------------------------------------------------
20+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
21+
extensions = [
22+
"sphinx.ext.autodoc",
23+
"sphinx.ext.napoleon",
24+
]
25+
autodoc_mock_imports = [
26+
"infinistore._infinistore",
27+
"torch",
28+
]
29+
30+
templates_path = ["_templates"]
31+
exclude_patterns = []
32+
33+
# -- Options for HTML output -------------------------------------------------
34+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
35+
36+
html_theme = "alabaster"
37+
html_static_path = ["_static"]

docs/source/design.rst

+87
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
Design and Architecture
2+
=======================
3+
4+
Introduction
5+
------------
6+
7+
Motivation
8+
~~~~~~~~~~
9+
LLM inference is moving to disaggregated architecture.
10+
LLM inference is moving from single-instance execution to cluster-level disaggregated architecture. Among all the efforts, prefill-decoding disaggregation is probably the most prominent change. The prefill phase requires more computational power, while the decode phase places a greater demand for memory. With this observation, prefill and decode phase disaggregation is an important aspect to improve inference engine performance.
11+
In addition to prefill-decode disaggregation, distributed KV cache could also increase the prefix KV cache hit rate, leading to higher GPU resource utilization.
12+
There are various related papers in this field, and some of them are even already in production:
13+
14+
- Mooncake: Kimi's production serving platform. A global KV store is made up of distributed DDR and SSD on each GPU host.
15+
- Splitwise: A prefill-decode disaggregation system, which requires KV cache transfer between different machines.
16+
- AttentionStore: Similar to Mooncake but it considers multi-turn conversation inference with positional-encoding separation from KV cache on a single node.
17+
- MemServe: An elastic memory pool managing distributed memory and KV caches across serving instances.
18+
19+
We identified many innovative or potential improvements in this transition.
20+
While analyzing the works above, we identified many potential improvements or new techniques to build a high-performance and scale cluster-level inference system, such as:
21+
22+
- Improvements on the request schedulers to build a more extensible and scalable scheduler,
23+
- Integrating with specific inference engine features (like extending the existing APC feature in vLLM),
24+
- Some new algorithms to better scale the memory pool and re-balance the hot sequences,
25+
- Exploring some new techniques such as de-coupled positional encoding, etc.
26+
27+
We are trying to build a high-performance open-source implementation to incorporate all the potential innovations mentioned above, so that different customers don't have to build their own.
28+
29+
30+
Features
31+
--------
32+
33+
Compared to a single instance vLLM, vLLM + InfiniStore supports the following new features:
34+
35+
- Prefill-decoding architecture
36+
- Historical KV cache in DRAM and SSD: a much larger pool than the current Automatic Prefix Cache (APC) feature in vLLM which is limited to GPU HBM.
37+
- Cross-host KV cache: one host can reuse the historical KV cache on another host.
38+
39+
40+
Architecture
41+
------------
42+
43+
.. image:: img/arch.png
44+
:align: center
45+
46+
1. Infinistore and vLLM are deployed on the same server, reusing the local CPU and memory resources.
47+
48+
2. The memcopy speed within the same machine is significantly faster than RDMA. It is recommended to use local GPU copy when reading and writing to the local Infinistore.
49+
50+
3. Infinistore uses the traditional key-value structure, supporting variable-length keys. This facilitates storing information like model_id, request, and token hash in the key.
51+
Since RDMA memory registration is very slow, Infinistore pre-registers memory for RDMA during startup and implements memory management using a memory pool.
52+
The current memory management algorithms support bitmap or jemalloc, with bitmap being the default.
53+
54+
4. Read and Write Process:
55+
56+
a. Prefill Stage:
57+
vLLM writes to the kvcache layer by layer during the prefill stage. Communication methods can be either local GPU copy or RDMA.
58+
Practical experience shows that the layer-by-layer approach parallelizes network communication and GPU computation. Measurements indicate that during the prefill stage, the network overhead increases by no more than 1%.
59+
For a demo implementation, refer to: demo_prefill.py
60+
61+
b. Decode Stage:
62+
In the decode stage, a separate thread in vLLM downloads the kvcache and then notifies the scheduler to start the decoding process.
63+
Unlike the current community implementation of vLLM, to ensure that network operations do not block the GPU during the decode stage, an additional thread is required to download data.
64+
65+
Communications
66+
--------------
67+
68+
69+
local gpu copy
70+
~~~~~~~~~~~~~~
71+
72+
.. image:: img/local_gpu_cpy.png
73+
:align: center
74+
75+
76+
rdma write
77+
~~~~~~~~~~
78+
79+
.. image:: img/rdma_write.png
80+
:align: center
81+
82+
rdma read
83+
~~~~~~~~~
84+
85+
86+
.. image:: img/rdma_read.png
87+
:align: center

docs/source/img/arch.png

195 KB
Loading

docs/source/img/local_gpu_cpy.png

93.9 KB
Loading

docs/source/img/rdma_read.png

93.9 KB
Loading

docs/source/img/rdma_write.png

39.8 KB
Loading

docs/source/index.rst

+24
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
.. infinistore documentation master file, created by
2+
sphinx-quickstart on Mon Jan 6 19:40:38 2025.
3+
You can adapt this file completely to your liking, but it should at least
4+
contain the root `toctree` directive.
5+
6+
Welcome to infinistore's documentation!
7+
=======================================
8+
9+
.. toctree::
10+
:maxdepth: 1
11+
12+
13+
design
14+
api
15+
16+
17+
18+
19+
Indices and tables
20+
==================
21+
22+
* :ref:`genindex`
23+
* :ref:`modindex`
24+
* :ref:`search`

infinistore/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,13 @@
99
check_supported,
1010
LINK_ETHERNET,
1111
LINK_IB,
12+
register_server,
1213
)
1314

1415
__all__ = [
1516
"InfinityConnection",
1617
"DisableTorchCaching",
18+
"register_server",
1719
"ClientConfig",
1820
"ServerConfig",
1921
"TYPE_RDMA",

0 commit comments

Comments
 (0)