diff --git a/comps/llms/text-generation/tgi/llama_stack/README.md b/comps/llms/text-generation/tgi/llama_stack/README.md index ed4879933..da073e87b 100644 --- a/comps/llms/text-generation/tgi/llama_stack/README.md +++ b/comps/llms/text-generation/tgi/llama_stack/README.md @@ -16,11 +16,15 @@ export LLM_MODEL_ID="meta-llama/Llama-3.1-8B-Instruct" # change to your llama mo export TGI_LLM_ENDPOINT="http://${your_ip}:8008" export LLAMA_STACK_ENDPOINT="http://${your_ip}:5000" ``` + Insert `TGI_LLM_ENDPOINT` to llama stack configuration yaml, you can use `envsubst` command, or just replace `${TGI_LLM_ENDPOINT}` with actual value manually. + ```bash envsubst < ./dependency/llama_stack_run_template.yaml > ./dependency/llama_stack_run.yaml ``` + Make sure get a `llama_stack_run.yaml` file, in which the inference provider is pointing to the correct TGI server endpoint. E.g. + ```bash inference: - provider_id: tgi0 @@ -40,9 +44,11 @@ pip install -r requirements.txt ``` ### 2.2 Start TGI Service + First we start a TGI endpoint for your LLM model on Gaudi. + ```bash -volume="./data" +volume="./data" docker run -p 8008:80 \ --name tgi_service \ -v $volume:/data \ @@ -63,7 +69,9 @@ docker run -p 8008:80 \ ``` ### 2.3 Start Llama Stack Server + Then we start the Llama Stack server based on TGI endpoint. + ```bash docker run \ --name llamastack-service \ @@ -74,9 +82,11 @@ docker run \ ``` ### 2.4 Start Microservice with Python Script + ```bash python llm.py ``` + ## 🚀3. Start Microservice with Docker (Option 2) If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start TGI and Llama Stack service with docker. @@ -119,8 +129,8 @@ curl http://${your_ip}:9000/v1/health_check\ ### 4.2 Consume the Services - Verify the TGI Service + ```bash curl http://${your_ip}:8008/generate \ -X POST \ @@ -129,6 +139,7 @@ curl http://${your_ip}:8008/generate \ ``` Verify Llama Stack Service + ```bash curl http://${your_ip}:5000/inference/chat_completion \ -H "Content-Type: application/json" \ @@ -156,4 +167,4 @@ curl http://${your_ip}:9000/v1/chat/completions \ -X POST \ -d '{"query":"What is Deep Learning?","max_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \ -H 'Content-Type: application/json' -``` \ No newline at end of file +``` diff --git a/comps/llms/text-generation/tgi/llama_stack/dependency/llama_stack_run_template.yaml b/comps/llms/text-generation/tgi/llama_stack/dependency/llama_stack_run_template.yaml index 48ba65dd6..5cee81449 100644 --- a/comps/llms/text-generation/tgi/llama_stack/dependency/llama_stack_run_template.yaml +++ b/comps/llms/text-generation/tgi/llama_stack/dependency/llama_stack_run_template.yaml @@ -1,3 +1,6 @@ +# Copyright (C) 2024 Intel Corporation +# SPDX-License-Identifier: Apache-2.0 + version: '2' built_at: '2024-10-08T17:40:45.325529' image_name: local diff --git a/comps/llms/text-generation/tgi/llama_stack/llm.py b/comps/llms/text-generation/tgi/llama_stack/llm.py index c4b0f8113..e741998f0 100644 --- a/comps/llms/text-generation/tgi/llama_stack/llm.py +++ b/comps/llms/text-generation/tgi/llama_stack/llm.py @@ -20,6 +20,7 @@ logger = CustomLogger("llm_tgi_llama_stack") logflag = os.getenv("LOGFLAG", False) + @register_microservice( name="opea_service@llm_tgi_llama_stack", service_type=ServiceType.LLM, @@ -70,4 +71,4 @@ async def stream_generator(): if __name__ == "__main__": - opea_microservices["opea_service@llm_tgi_llama_stack"].start() \ No newline at end of file + opea_microservices["opea_service@llm_tgi_llama_stack"].start() diff --git a/comps/llms/text-generation/tgi/llama_stack/requirements.txt b/comps/llms/text-generation/tgi/llama_stack/requirements.txt index 289f741f9..1c6f37d9f 100644 --- a/comps/llms/text-generation/tgi/llama_stack/requirements.txt +++ b/comps/llms/text-generation/tgi/llama_stack/requirements.txt @@ -3,6 +3,8 @@ docarray[full] fastapi httpx huggingface_hub +llama-stack +llama-stack-client opentelemetry-api opentelemetry-exporter-otlp opentelemetry-sdk @@ -10,5 +12,3 @@ prometheus-fastapi-instrumentator shortuuid transformers uvicorn -llama-stack-client -llama-stack \ No newline at end of file