Nebius package for vLLM

Description

vLLM is a fast and easy-to-use library for LLM inference and serving. With its PagedAttention algorithm that manages attention keys and values efficiently, vLLM delivers state-of-the-art high-throughput serving. The library is flexible and easy-to-use as it provides seamless integration with popular Hugging Face models, support of various decoding algorithms, an OpenAI-compatible API server, and more. With vLLM, you can deploy and scale LLM applications seamlessly, leveraging Kubernetes' flexibility and scalability to meet the demands of modern AI workloads.

Short description

A fast and easy-to-use library for LLM inference and serving.

Tutorial

Before installing this product:

Create a node group with GPUs in it. The product supports the following VM platforms with GPUs:
- NVIDIA® H100 NVLink with Intel Sapphire Rapids
  
  {% note warning %}
  
  Before installing vLLM, you must install NVIDIA® GPU Operator on the cluster.
  
  {% endnote %}

To install the product:

Configure the application:

{% note info %}

To enable authentication in Gradio, you should set both username and password. If they are not set, Gradio will be available without authentication.

{% endnote %}
Click Install.
Wait for the application to change its status to Deployed.

Usage

To check that vLLM is working, test the OpenAI Completions API served by vLLM:

Set up port forwarding:

kubectl -n <namespace> port-forward \
  services/<application_name>-service 8000:8000

Send a request to the API (the example uses the default h2oai/h2o-danube2-1.8b-chat model; you can modify it):

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "h2oai/h2o-danube2-1.8b-chat",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7
  }'

If you enabled Gradio, to check that it is working, access it:

Set up port forwarding:

kubectl -n <namespace> port-forward \
  services/<application_name>-gradio-ui 7860:7860

Go to http://localhost:7860/ in your web browser. If you have set credentials when installing the product, use them to log into the UI.

Use cases

Natural language understanding (NLU) applications requiring efficient inference with large language models.
Sentiment analysis and text classification tasks in various industries such as social media monitoring, customer feedback analysis, and content moderation.
Language translation services requiring high-throughput and low-latency inference for real-time translation.
Question answering systems for knowledge bases, customer support, and virtual assistants.
Recommendation systems for personalized content delivery in e-commerce, streaming platforms, and social networks.
Chatbots and conversational AI applications for interactive user experiences in customer service, healthcare, and education.
Text summarization and information retrieval for content curation, search engines, and document management systems.
Named entity recognition (NER) and entity linking for information extraction and knowledge graph construction in data analytics and research.

Links

vLLM website
vLLM on GitHub
vLLM documentation
Gradio website
Gradio on GitHub
Gradio documentation

Legal

By using the application, you agree to their terms and conditions: the helm-chart and vLLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Nebius package for vLLM

Description

Short description

Tutorial

Usage

Use cases

Links

Legal

Files

README.md

Latest commit

History

README.md

File metadata and controls

Nebius package for vLLM

Description

Short description

Tutorial

Usage

Use cases

Links

Legal