-
Notifications
You must be signed in to change notification settings - Fork 136
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add langchain vllm support for DocSum along with authentication suppo…
…rt for vllm endpoints
- Loading branch information
Showing
8 changed files
with
302 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
FROM python:3.11-slim | ||
|
||
ARG ARCH="cpu" | ||
|
||
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ | ||
libgl1-mesa-glx \ | ||
libjemalloc-dev | ||
|
||
RUN useradd -m -s /bin/bash user && \ | ||
mkdir -p /home/user && \ | ||
chown -R user /home/user/ | ||
|
||
USER user | ||
|
||
COPY comps /home/user/comps | ||
|
||
RUN pip install --no-cache-dir --upgrade pip setuptools && \ | ||
if [ ${ARCH} = "cpu" ]; then pip install --no-cache-dir torch torchvision --index-url https://download.pytorch.org/whl/cpu; fi && \ | ||
pip install --no-cache-dir -r /home/user/comps/llms/summarization/vllm/langchain/requirements.txt | ||
|
||
ENV PYTHONPATH=$PYTHONPATH:/home/user | ||
|
||
WORKDIR /home/user/comps/llms/summarization/vllm/langchain | ||
|
||
ENTRYPOINT ["bash", "entrypoint.sh"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Document Summary vLLM Microservice | ||
|
||
This microservice leverages LangChain to implement summarization strategies and facilitate LLM inference using Text Generation Inference on Intel Xeon and Gaudi2 processors. | ||
[Text Generation Inference](https://github.com/huggingface/text-generation-inference) (vLLM) is a toolkit for deploying and serving Large Language Models (LLMs). vLLM enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more. | ||
|
||
## 🚀1. Start Microservice with Python 🐍 (Option 1) | ||
|
||
To start the LLM microservice, you need to install python packages first. | ||
|
||
### 1.1 Install Requirements | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### 1.2 Start LLM Service | ||
|
||
```bash | ||
export HF_TOKEN=${your_hf_api_token} | ||
docker run -p 8008:80 -v ./data:/data --name llm-docsum-vllm --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.1.0 --model-id ${your_hf_llm_model} | ||
``` | ||
|
||
### 1.3 Verify the vLLM Service | ||
|
||
```bash | ||
curl http://${your_ip}:8008/generate \ | ||
-X POST \ | ||
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
### 1.4 Start LLM Service with Python Script | ||
|
||
```bash | ||
export vLLM_ENDPOINT="http://${your_ip}:8008" | ||
python llm.py | ||
``` | ||
|
||
## 🚀2. Start Microservice with Docker 🐳 (Option 2) | ||
|
||
If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a vLLM/vLLM service with docker. | ||
|
||
### 2.1 Setup Environment Variables | ||
|
||
In order to start vLLM and LLM services, you need to setup the following environment variables first. | ||
|
||
```bash | ||
export HF_TOKEN=${your_hf_api_token} | ||
export vLLM_ENDPOINT="http://${your_ip}:8008" | ||
export LLM_MODEL_ID=${your_hf_llm_model} | ||
``` | ||
|
||
### 2.2 Build Docker Image | ||
|
||
```bash | ||
cd ../../../../../ | ||
docker build -t opea/llm-docsum-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/summarization/vllm/langchain/Dockerfile . | ||
``` | ||
|
||
To start a docker container, you have two options: | ||
|
||
- A. Run Docker with CLI | ||
- B. Run Docker with Docker Compose | ||
|
||
You can choose one as needed. | ||
|
||
### 2.3 Run Docker with CLI (Option A) | ||
|
||
```bash | ||
docker run -d --name="llm-docsum-vllm-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_ENDPOINT=$vLLM_ENDPOINT -e HF_TOKEN=$HF_TOKEN opea/llm-docsum-vllm:latest | ||
``` | ||
|
||
### 2.4 Run Docker with Docker Compose (Option B) | ||
|
||
```bash | ||
docker compose -f docker_compose_llm.yaml up -d | ||
``` | ||
|
||
## 🚀3. Consume LLM Service | ||
|
||
### 3.1 Check Service Status | ||
|
||
```bash | ||
curl http://${your_ip}:9000/v1/health_check\ | ||
-X GET \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
### 3.2 Consume LLM Service | ||
|
||
```bash | ||
# Enable streaming to receive a streaming response. By default, this is set to True. | ||
curl http://${your_ip}:9000/v1/chat/docsum \ | ||
-X POST \ | ||
-d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en"}' \ | ||
-H 'Content-Type: application/json' | ||
|
||
# Disable streaming to receive a non-streaming response. | ||
curl http://${your_ip}:9000/v1/chat/docsum \ | ||
-X POST \ | ||
-d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en", "streaming":false}' \ | ||
-H 'Content-Type: application/json' | ||
|
||
# Use Chinese mode. By default, language is set to "en" | ||
curl http://${your_ip}:9000/v1/chat/docsum \ | ||
-X POST \ | ||
-d '{"query":"2024年9月26日,北京——今日,英特尔正式发布英特尔® 至强® 6性能核处理器(代号Granite Rapids),为AI、数据分析、科学计算等计算密集型业务提供卓越性能。", "max_tokens":32, "language":"zh", "streaming":false}' \ | ||
-H 'Content-Type: application/json' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 |
35 changes: 35 additions & 0 deletions
35
comps/llms/summarization/vllm/langchain/docker_compose_llm.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
version: "3.8" | ||
|
||
services: | ||
vllm_service: | ||
image: ghcr.io/huggingface/text-generation-inference:2.1.0 | ||
container_name: vllm-service | ||
ports: | ||
- "8008:80" | ||
volumes: | ||
- "./data:/data" | ||
environment: | ||
HF_TOKEN: ${HF_TOKEN} | ||
shm_size: 1g | ||
command: --model-id ${LLM_MODEL_ID} | ||
llm: | ||
image: opea/llm-docsum-vllm:latest | ||
container_name: llm-docsum-vllm-server | ||
ports: | ||
- "9000:9000" | ||
ipc: host | ||
environment: | ||
no_proxy: ${no_proxy} | ||
http_proxy: ${http_proxy} | ||
https_proxy: ${https_proxy} | ||
vLLM_ENDPOINT: ${vLLM_ENDPOINT} | ||
HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN} | ||
LLM_MODEL_ID: ${LLM_MODEL_ID} | ||
restart: unless-stopped | ||
|
||
networks: | ||
default: | ||
driver: bridge |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
#!/usr/bin/env bash | ||
|
||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
pip --no-cache-dir install -r requirements-runtime.txt | ||
|
||
python llm.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
import os | ||
|
||
from fastapi.responses import StreamingResponse | ||
from langchain.chains.summarize import load_summarize_chain | ||
from langchain.docstore.document import Document | ||
from langchain.prompts import PromptTemplate | ||
from langchain.text_splitter import CharacterTextSplitter | ||
from langchain_community.llms import VLLMOpenAI | ||
|
||
from comps import CustomLogger, GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice | ||
from comps.cores.mega.utils import get_access_token | ||
|
||
logger = CustomLogger("llm_docsum") | ||
logflag = os.getenv("LOGFLAG", False) | ||
|
||
# Environment variables | ||
TOKEN_URL = os.getenv("TOKEN_URL") | ||
CLIENTID = os.getenv("CLIENTID") | ||
CLIENT_SECRET = os.getenv("CLIENT_SECRET") | ||
MODEL_ID = os.getenv("LLM_MODEL_ID", None) | ||
|
||
templ_en = """Write a concise summary of the following: | ||
"{text}" | ||
CONCISE SUMMARY:""" | ||
|
||
templ_zh = """请简要概括以下内容: | ||
"{text}" | ||
概况:""" | ||
|
||
def post_process_text(text: str): | ||
if text == " ": | ||
return "data: @#$\n\n" | ||
if text == "\n": | ||
return "data: <br/>\n\n" | ||
if text.isspace(): | ||
return None | ||
new_text = text.replace(" ", "@#$") | ||
return f"data: {new_text}\n\n" | ||
|
||
@register_microservice( | ||
name="opea_service@llm_docsum", | ||
service_type=ServiceType.LLM, | ||
endpoint="/v1/chat/docsum", | ||
host="0.0.0.0", | ||
port=9000, | ||
) | ||
async def llm_generate(input: LLMParamsDoc): | ||
if logflag: | ||
logger.info(input) | ||
if input.language in ["en", "auto"]: | ||
templ = templ_en | ||
elif input.language in ["zh"]: | ||
templ = templ_zh | ||
else: | ||
raise NotImplementedError('Please specify the input language in "en", "zh", "auto"') | ||
|
||
PROMPT = PromptTemplate.from_template(templ) | ||
|
||
if logflag: | ||
logger.info("After prompting:") | ||
logger.info(PROMPT) | ||
|
||
access_token = get_access_token(TOKEN_URL, CLIENTID, CLIENT_SECRET) if TOKEN_URL and CLIENTID and CLIENT_SECRET else None | ||
headers = {} | ||
if access_token: | ||
headers = {"Authorization": f"Bearer {access_token}"} | ||
llm_endpoint = os.getenv("vLLM_ENDPOINT", "http://localhost:8080") | ||
model = input.model if input.model else os.getenv("LLM_MODEL_ID") | ||
llm = VLLMOpenAI(openai_api_key="EMPTY", openai_api_base=llm_endpoint + "/v1", model_name=model, default_headers=headers, max_tokens=input.max_tokens, top_p=input.top_p, streaming=input.streaming, temperature=input.temperature, presence_penalty=input.repetition_penalty) | ||
llm_chain = load_summarize_chain(llm=llm, prompt=PROMPT) | ||
texts = text_splitter.split_text(input.query) | ||
|
||
# Create multiple documents | ||
docs = [Document(page_content=t) for t in texts] | ||
|
||
if input.streaming: | ||
|
||
async def stream_generator(): | ||
from langserve.serialization import WellKnownLCSerializer | ||
|
||
_serializer = WellKnownLCSerializer() | ||
async for chunk in llm_chain.astream_log(docs): | ||
data = _serializer.dumps({"ops": chunk.ops}).decode("utf-8") | ||
if logflag: | ||
logger.info(data) | ||
yield f"data: {data}\n\n" | ||
yield "data: [DONE]\n\n" | ||
|
||
return StreamingResponse(stream_generator(), media_type="text/event-stream") | ||
else: | ||
response = await llm_chain.ainvoke(docs) | ||
response = response["output_text"] | ||
if logflag: | ||
logger.info(response) | ||
return GeneratedDoc(text=response, prompt=input.query) | ||
|
||
|
||
if __name__ == "__main__": | ||
# Split text | ||
text_splitter = CharacterTextSplitter() | ||
opea_microservices["opea_service@llm_docsum"].start() |
1 change: 1 addition & 0 deletions
1
comps/llms/summarization/vllm/langchain/requirements-runtime.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
langserve |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
docarray[full] | ||
fastapi | ||
huggingface_hub | ||
langchain #==0.1.12 | ||
langchain-huggingface | ||
langchain-openai | ||
langchain_community | ||
langchainhub | ||
opentelemetry-api | ||
opentelemetry-exporter-otlp | ||
opentelemetry-sdk | ||
prometheus-fastapi-instrumentator | ||
shortuuid | ||
transformers | ||
uvicorn |