Deployment

This directory provides a quick overview for deployment of local LLMs as an API service.

Text Generation Inference (HF) (`text-generation-inference`)

text-generation-inference is a Rust, Python and gRPC server for LLMs. It is used in production at HuggingFace to power their widgets in the model hub. This is the recommended option to serve LLMs at production level with large number of requests and high availability.

Pros

Fast inference speed due to optimized CUDA kernel (FlashAttention)
Support model sharding to parallel model execution on multiple GPUs
Provide streaming + blocking APIs
Support Continuous batching for increased throughput
Production-ready (tracing, monitoring)
Support 8-bit and GPTQ quantization

Cons

Don't support LoRA loading
GPTQ speed is not the best
Need to compile from source (Rust + CUDA kernel) if hosted natively (we can use provided Docker container to streamline the process).

Example script to run the inference server is provided in run_inference_server.sh. Replace your model name in model= and run:

bash scripts/run_inference_server.sh

Default port of the server is 8080. You can then query the model using either the /generate or /generate_stream routes:

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

or from Python:

pip install text-generation

from text_generation import Client

client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=17).generated_text)

text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=17):
    if not response.token.special:
        text += response.token.text
print(text)

Check more examples and documentations in documents.

NOTE: Since text-generation-inferece does not support LoRA at the moment, we need to export the checkpoint to merge LoRA with the original base model as mentioned in README.

FastAPI + ExLlama (`fastapi_server.py`)

Derived from source. A simple stateless API server for ExLlama is provided in fastapi_server.py.

Pros

Very fast inference speed for GPTQ quantized model
Support LoRA loading
Provide streaming + blocking APIs
Has a quick Chat-UI demo

Cons

Don't support Continuous batching
Not production level

To launcher the server, use this command:

python fastapi_server.py -d <path_to_model> -l <path_to_optional_lora>

Default endpoint is localhost:8080. Sample API request is provided in fastapi_request.py. Also a chat UI is provided when accessing the default URL on a browser (suitable for testing purpose).

VLLM

Improve inference throughput with PagedAttention.

To be updated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Deployment

Text Generation Inference (HF) (`text-generation-inference`)

Pros

Cons

FastAPI + ExLlama (`fastapi_server.py`)

Pros

Cons

VLLM

Files

README.md

Latest commit

History

README.md

File metadata and controls

Deployment

Text Generation Inference (HF) (text-generation-inference)

Pros

Cons

FastAPI + ExLlama (fastapi_server.py)

Pros

Cons

VLLM

Text Generation Inference (HF) (`text-generation-inference`)

FastAPI + ExLlama (`fastapi_server.py`)