vLLM is a fast and easy-to-use library for LLM inference and serving. With its PagedAttention algorithm that manages attention keys and values efficiently, vLLM delivers state-of-the-art high-throughput serving. The library is flexible and easy-to-use as it provides seamless integration with popular Hugging Face models, support of various decoding algorithms, an OpenAI-compatible API server, and more. With vLLM, you can deploy and scale LLM applications seamlessly, leveraging Kubernetes' flexibility and scalability to meet the demands of modern AI workloads.
A fast and easy-to-use library for LLM inference and serving.
Before installing this product:
-
Create a node group with GPUs in it. The product supports the following VM platforms with GPUs:
-
NVIDIA® H100 NVLink with Intel Sapphire Rapids
{% note warning %}
Before installing vLLM, you must install NVIDIA® GPU Operator on the cluster.
{% endnote %}
-
To install the product:
-
Configure the application:
{% note info %}
To enable authentication in Gradio, you should set both username and password. If they are not set, Gradio will be available without authentication.
{% endnote %}
-
Click Install.
-
Wait for the application to change its status to
Deployed
.
To check that vLLM is working, test the OpenAI Completions API served by vLLM:
-
Set up port forwarding:
kubectl -n <namespace> port-forward \ services/<application_name>-service 8000:8000
-
Send a request to the API (the example uses the default
h2oai/h2o-danube2-1.8b-chat
model; you can modify it):curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "h2oai/h2o-danube2-1.8b-chat", "messages": [ {"role": "user", "content": "Hello"} ], "temperature": 0.7 }'
If you enabled Gradio, to check that it is working, access it:
-
Set up port forwarding:
kubectl -n <namespace> port-forward \ services/<application_name>-gradio-ui 7860:7860
-
Go to http://localhost:7860/ in your web browser. If you have set credentials when installing the product, use them to log into the UI.
- Natural language understanding (NLU) applications requiring efficient inference with large language models.
- Sentiment analysis and text classification tasks in various industries such as social media monitoring, customer feedback analysis, and content moderation.
- Language translation services requiring high-throughput and low-latency inference for real-time translation.
- Question answering systems for knowledge bases, customer support, and virtual assistants.
- Recommendation systems for personalized content delivery in e-commerce, streaming platforms, and social networks.
- Chatbots and conversational AI applications for interactive user experiences in customer service, healthcare, and education.
- Text summarization and information retrieval for content curation, search engines, and document management systems.
- Named entity recognition (NER) and entity linking for information extraction and knowledge graph construction in data analytics and research.
By using the application, you agree to their terms and conditions: the helm-chart and vLLM.