Skip to content

Latest commit

 

History

History
96 lines (63 loc) · 4.16 KB

README.md

File metadata and controls

96 lines (63 loc) · 4.16 KB

Nebius package for vLLM

Description

vLLM is a fast and easy-to-use library for LLM inference and serving. With its PagedAttention algorithm that manages attention keys and values efficiently, vLLM delivers state-of-the-art high-throughput serving. The library is flexible and easy-to-use as it provides seamless integration with popular Hugging Face models, support of various decoding algorithms, an OpenAI-compatible API server, and more. With vLLM, you can deploy and scale LLM applications seamlessly, leveraging Kubernetes' flexibility and scalability to meet the demands of modern AI workloads.

Short description

A fast and easy-to-use library for LLM inference and serving.

Tutorial

Before installing this product:

  1. Create a node group with GPUs in it. The product supports the following VM platforms with GPUs:

    • NVIDIA® H100 NVLink with Intel Sapphire Rapids

      {% note warning %}

      Before installing vLLM, you must install NVIDIA® GPU Operator on the cluster.

      {% endnote %}

To install the product:

  1. Configure the application:

    {% note info %}

    To enable authentication in Gradio, you should set both username and password. If they are not set, Gradio will be available without authentication.

    {% endnote %}

  2. Click Install.

  3. Wait for the application to change its status to Deployed.

Usage

To check that vLLM is working, test the OpenAI Completions API served by vLLM:

  1. Set up port forwarding:

    kubectl -n <namespace> port-forward \
      services/<application_name>-service 8000:8000
  2. Send a request to the API (the example uses the default h2oai/h2o-danube2-1.8b-chat model; you can modify it):

    curl -s http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "h2oai/h2o-danube2-1.8b-chat",
        "messages": [
          {"role": "user", "content": "Hello"}
        ],
        "temperature": 0.7
      }'

If you enabled Gradio, to check that it is working, access it:

  1. Set up port forwarding:

    kubectl -n <namespace> port-forward \
      services/<application_name>-gradio-ui 7860:7860
  2. Go to http://localhost:7860/ in your web browser. If you have set credentials when installing the product, use them to log into the UI.

Use cases

  • Natural language understanding (NLU) applications requiring efficient inference with large language models.
  • Sentiment analysis and text classification tasks in various industries such as social media monitoring, customer feedback analysis, and content moderation.
  • Language translation services requiring high-throughput and low-latency inference for real-time translation.
  • Question answering systems for knowledge bases, customer support, and virtual assistants.
  • Recommendation systems for personalized content delivery in e-commerce, streaming platforms, and social networks.
  • Chatbots and conversational AI applications for interactive user experiences in customer service, healthcare, and education.
  • Text summarization and information retrieval for content curation, search engines, and document management systems.
  • Named entity recognition (NER) and entity linking for information extraction and knowledge graph construction in data analytics and research.

Links

Legal

By using the application, you agree to their terms and conditions: the helm-chart and vLLM.