-
Notifications
You must be signed in to change notification settings - Fork 136
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add model parameter for FaqGenGateway in gateway.py file Signed-off-by: sgurunat <[email protected]> * Add langchain vllm support for FaqGen along with authentication support for vllm endpoints Signed-off-by: sgurunat <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated docker_compose_llm.yaml and README file with vLLM information Signed-off-by: sgurunat <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated faq-vllm Dockerfile into llm-compose-cd.yaml under github workflows Signed-off-by: sgurunat <[email protected]> * Updated llm-compose.yaml file to include vllm faqgen build Signed-off-by: sgurunat <[email protected]> --------- Signed-off-by: sgurunat <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
baafa40
commit f5c60f1
Showing
10 changed files
with
281 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
FROM python:3.11-slim | ||
|
||
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \ | ||
libgl1-mesa-glx \ | ||
libjemalloc-dev | ||
|
||
RUN useradd -m -s /bin/bash user && \ | ||
mkdir -p /home/user && \ | ||
chown -R user /home/user/ | ||
|
||
USER user | ||
|
||
COPY comps /home/user/comps | ||
|
||
RUN pip install --no-cache-dir --upgrade pip setuptools && \ | ||
pip install --no-cache-dir -r /home/user/comps/llms/faq-generation/vllm/langchain/requirements.txt | ||
|
||
ENV PYTHONPATH=$PYTHONPATH:/home/user | ||
|
||
WORKDIR /home/user/comps/llms/faq-generation/vllm/langchain | ||
|
||
ENTRYPOINT ["bash", "entrypoint.sh"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# vLLM FAQGen LLM Microservice | ||
|
||
This microservice interacts with the vLLM server to generate FAQs from Input Text.[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products). | ||
|
||
## 🚀1. Start Microservice with Docker | ||
|
||
If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a VLLM service with docker. | ||
|
||
To setup or build the vLLM image follow the instructions provided in [vLLM Gaudi](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm/langchain#22-vllm-on-gaudi) | ||
|
||
### 1.1 Setup Environment Variables | ||
|
||
In order to start vLLM and LLM services, you need to setup the following environment variables first. | ||
|
||
```bash | ||
export HF_TOKEN=${your_hf_api_token} | ||
export vLLM_ENDPOINT="http://${your_ip}:8008" | ||
export LLM_MODEL_ID=${your_hf_llm_model} | ||
``` | ||
|
||
### 1.3 Build Docker Image | ||
|
||
```bash | ||
cd ../../../../../ | ||
docker build -t opea/llm-faqgen-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/faq-generation/vllm/langchain/Dockerfile . | ||
``` | ||
|
||
To start a docker container, you have two options: | ||
|
||
- A. Run Docker with CLI | ||
- B. Run Docker with Docker Compose | ||
|
||
You can choose one as needed. | ||
|
||
### 1.3 Run Docker with CLI (Option A) | ||
|
||
```bash | ||
docker run -d -p 8008:80 -v ./data:/data --name vllm-service --shm-size 1g opea/vllm:hpu --model-id ${LLM_MODEL_ID} | ||
``` | ||
|
||
```bash | ||
docker run -d --name="llm-faqgen-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_ENDPOINT=$vLLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HF_TOKEN opea/llm-faqgen-vllm:latest | ||
``` | ||
|
||
### 1.4 Run Docker with Docker Compose (Option B) | ||
|
||
```bash | ||
docker compose -f docker_compose_llm.yaml up -d | ||
``` | ||
|
||
## 🚀3. Consume LLM Service | ||
|
||
### 3.1 Check Service Status | ||
|
||
```bash | ||
curl http://${your_ip}:9000/v1/health_check\ | ||
-X GET \ | ||
-H 'Content-Type: application/json' | ||
``` | ||
|
||
### 3.2 Consume FAQGen LLM Service | ||
|
||
```bash | ||
# Streaming Response | ||
# Set streaming to True. Default will be True. | ||
curl http://${your_ip}:9000/v1/faqgen \ | ||
-X POST \ | ||
-d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5."}' \ | ||
-H 'Content-Type: application/json' | ||
|
||
# Non-Streaming Response | ||
# Set streaming to False. | ||
curl http://${your_ip}:9000/v1/faqgen \ | ||
-X POST \ | ||
-d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "streaming":false}' \ | ||
-H 'Content-Type: application/json' | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 |
46 changes: 46 additions & 0 deletions
46
comps/llms/faq-generation/vllm/langchain/docker_compose_llm.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
version: "3.8" | ||
|
||
services: | ||
vllm-service: | ||
image: opea/vllm:hpu | ||
container_name: vllm-gaudi-server | ||
ports: | ||
- "8008:80" | ||
volumes: | ||
- "./data:/data" | ||
environment: | ||
no_proxy: ${no_proxy} | ||
http_proxy: ${http_proxy} | ||
https_proxy: ${https_proxy} | ||
HF_TOKEN: ${HF_TOKEN} | ||
HABANA_VISIBLE_DEVICES: all | ||
OMPI_MCA_btl_vader_single_copy_mechanism: none | ||
LLM_MODEL_ID: ${LLM_MODEL_ID} | ||
runtime: habana | ||
cap_add: | ||
- SYS_NICE | ||
ipc: host | ||
command: --enforce-eager --model $LLM_MODEL_ID --tensor-parallel-size 1 --host 0.0.0.0 --port 80 | ||
llm: | ||
image: opea/llm-faqgen-vllm:latest | ||
container_name: llm-faqgen-server | ||
depends_on: | ||
- vllm-service | ||
ports: | ||
- "9000:9000" | ||
ipc: host | ||
environment: | ||
no_proxy: ${no_proxy} | ||
http_proxy: ${http_proxy} | ||
https_proxy: ${https_proxy} | ||
vLLM_ENDPOINT: ${vLLM_ENDPOINT} | ||
HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN} | ||
LLM_MODEL_ID: ${LLM_MODEL_ID} | ||
restart: unless-stopped | ||
|
||
networks: | ||
default: | ||
driver: bridge |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
#!/usr/bin/env bash | ||
|
||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
pip --no-cache-dir install -r requirements-runtime.txt | ||
|
||
python llm.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# Copyright (C) 2024 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
import os | ||
|
||
from fastapi.responses import StreamingResponse | ||
from langchain.chains.summarize import load_summarize_chain | ||
from langchain.docstore.document import Document | ||
from langchain.prompts import PromptTemplate | ||
from langchain.text_splitter import CharacterTextSplitter | ||
from langchain_community.llms import VLLMOpenAI | ||
|
||
from comps import CustomLogger, GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice | ||
from comps.cores.mega.utils import get_access_token | ||
|
||
logger = CustomLogger("llm_faqgen") | ||
logflag = os.getenv("LOGFLAG", False) | ||
|
||
# Environment variables | ||
TOKEN_URL = os.getenv("TOKEN_URL") | ||
CLIENTID = os.getenv("CLIENTID") | ||
CLIENT_SECRET = os.getenv("CLIENT_SECRET") | ||
|
||
|
||
def post_process_text(text: str): | ||
if text == " ": | ||
return "data: @#$\n\n" | ||
if text == "\n": | ||
return "data: <br/>\n\n" | ||
if text.isspace(): | ||
return None | ||
new_text = text.replace(" ", "@#$") | ||
return f"data: {new_text}\n\n" | ||
|
||
|
||
@register_microservice( | ||
name="opea_service@llm_faqgen", | ||
service_type=ServiceType.LLM, | ||
endpoint="/v1/faqgen", | ||
host="0.0.0.0", | ||
port=9000, | ||
) | ||
async def llm_generate(input: LLMParamsDoc): | ||
if logflag: | ||
logger.info(input) | ||
access_token = ( | ||
get_access_token(TOKEN_URL, CLIENTID, CLIENT_SECRET) if TOKEN_URL and CLIENTID and CLIENT_SECRET else None | ||
) | ||
headers = {} | ||
if access_token: | ||
headers = {"Authorization": f"Bearer {access_token}"} | ||
|
||
model = input.model if input.model else os.getenv("LLM_MODEL_ID") | ||
llm = VLLMOpenAI( | ||
openai_api_key="EMPTY", | ||
openai_api_base=llm_endpoint + "/v1", | ||
model_name=model, | ||
default_headers=headers, | ||
max_tokens=input.max_tokens, | ||
top_p=input.top_p, | ||
streaming=input.streaming, | ||
temperature=input.temperature, | ||
) | ||
|
||
templ = """Create a concise FAQs (frequently asked questions and answers) for following text: | ||
TEXT: {text} | ||
Do not use any prefix or suffix to the FAQ. | ||
""" | ||
PROMPT = PromptTemplate.from_template(templ) | ||
llm_chain = load_summarize_chain(llm=llm, prompt=PROMPT) | ||
texts = text_splitter.split_text(input.query) | ||
|
||
# Create multiple documents | ||
docs = [Document(page_content=t) for t in texts] | ||
|
||
if input.streaming: | ||
|
||
async def stream_generator(): | ||
from langserve.serialization import WellKnownLCSerializer | ||
|
||
_serializer = WellKnownLCSerializer() | ||
async for chunk in llm_chain.astream_log(docs): | ||
data = _serializer.dumps({"ops": chunk.ops}).decode("utf-8") | ||
if logflag: | ||
logger.info(data) | ||
yield f"data: {data}\n\n" | ||
yield "data: [DONE]\n\n" | ||
|
||
return StreamingResponse(stream_generator(), media_type="text/event-stream") | ||
else: | ||
response = await llm_chain.ainvoke(docs) | ||
response = response["output_text"] | ||
if logflag: | ||
logger.info(response) | ||
return GeneratedDoc(text=response, prompt=input.query) | ||
|
||
|
||
if __name__ == "__main__": | ||
llm_endpoint = os.getenv("vLLM_ENDPOINT", "http://localhost:8080") | ||
# Split text | ||
text_splitter = CharacterTextSplitter() | ||
opea_microservices["opea_service@llm_faqgen"].start() |
1 change: 1 addition & 0 deletions
1
comps/llms/faq-generation/vllm/langchain/requirements-runtime.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
langserve |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
docarray[full] | ||
fastapi | ||
huggingface_hub | ||
langchain | ||
langchain-huggingface | ||
langchain-openai | ||
langchain_community | ||
langchainhub | ||
opentelemetry-api | ||
opentelemetry-exporter-otlp | ||
opentelemetry-sdk | ||
prometheus-fastapi-instrumentator | ||
shortuuid | ||
transformers | ||
uvicorn |