Skip to content

Commit 4cebe66

Browse files
committed
0.2.0 Release
- You no longer need a linux-based machine or NVIDIA GPUs to build the worker. - Over 3x lighter Docker image size. - OpenAI Chat Completion output format (optional to use). - Extremely fast image build time. - Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token. - Support for `n` and `best_of` sampling parameters, which allow you to generate multiple responses from a single prompt. - New environment variables for various configuration. - vLLM Version: 0.2.7
1 parent 368c5f8 commit 4cebe66

File tree

10 files changed

+365
-139
lines changed

10 files changed

+365
-139
lines changed

Dockerfile

Lines changed: 18 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,40 @@
1-
# syntax = docker/dockerfile:1.3
2-
ARG WORKER_CUDA_VERSION=11.8
3-
FROM runpod/base:0.4.4-cuda${WORKER_CUDA_VERSION}.0 as builder
4-
5-
ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope
6-
7-
# Set Environment Variables
8-
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
9-
HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
10-
HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
11-
TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
12-
HF_TRANSFER=1
1+
ARG WORKER_CUDA_VERSION=11.8.0
2+
FROM runpod/worker-vllm:base-0.2.0-cuda${WORKER_CUDA_VERSION} AS vllm-base
133

4+
RUN apt-get update -y \
5+
&& apt-get install -y python3-pip
146

157
# Install Python dependencies
168
COPY builder/requirements.txt /requirements.txt
179
RUN --mount=type=cache,target=/root/.cache/pip \
18-
python3.11 -m pip install --upgrade pip && \
19-
python3.11 -m pip install --upgrade -r /requirements.txt && \
20-
rm /requirements.txt
21-
22-
# Install torch and vllm based on CUDA version
23-
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
24-
python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
25-
python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
26-
else \
27-
python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
28-
fi && \
29-
rm -rf /root/.cache/pip
10+
python3 -m pip install --upgrade pip && \
11+
python3 -m pip install --upgrade -r /requirements.txt
3012

3113
# Add source files
32-
COPY src .
14+
COPY src /src
3315

3416
# Setup for Option 2: Building the Image with the Model included
3517
ARG MODEL_NAME=""
36-
ARG MODEL_BASE_PATH="/runpod-volume/"
18+
ARG MODEL_BASE_PATH="/runpod-volume"
3719
ARG QUANTIZATION=""
3820

3921
ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
4022
MODEL_NAME=$MODEL_NAME \
41-
QUANTIZATION=$QUANTIZATION
42-
23+
QUANTIZATION=$QUANTIZATION \
24+
HF_DATASETS_CACHE="${MODEL_BASE_PATH}/huggingface-cache/datasets" \
25+
HUGGINGFACE_HUB_CACHE="${MODEL_BASE_PATH}/huggingface-cache/hub" \
26+
HF_HOME="${MODEL_BASE_PATH}/huggingface-cache/hub" \
27+
HF_TRANSFER=1
28+
4329
RUN --mount=type=secret,id=HF_TOKEN,required=false \
4430
if [ -f /run/secrets/HF_TOKEN ]; then \
4531
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
4632
fi && \
4733
if [ -n "$MODEL_NAME" ]; then \
48-
python3.11 /download_model.py --model $MODEL_NAME; \
34+
python3 /src/download_model.py --model $MODEL_NAME; \
4935
fi
5036

37+
ENV PYTHONPATH="/:/vllm-installation"
38+
5139
# Start the handler
52-
CMD ["python3.11", "/handler.py"]
40+
CMD ["python3", "/src/handler.py"]

README.md

Lines changed: 48 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,25 @@
11
<div align="center">
22

3-
<h1>vLLM 0.2.6 Endpoint | Serverless Worker </h1>
3+
<h1> vLLM Serverless Endpoint Worker </h1>
44

55
[![CD | Docker-Build-Release](https://github.com/runpod-workers/worker-vllm/actions/workflows/docker-build-release.yml/badge.svg)](https://github.com/runpod-workers/worker-vllm/actions/workflows/docker-build-release.yml)
66

7-
🚀 | This serverless worker utilizes vLLM behind the scenes and is integrated into RunPod's serverless environment. It supports dynamic auto-scaling using the built-in RunPod autoscaling feature.
7+
Deploy Blazing-fast LLMs powered by [vLLM](https://github.com/vllm-project/vllm) on RunPod Serverless in a few clicks.
88
</div>
99

10+
### Worker vLLM 0.2.0 - What's New
11+
- You no longer need a linux-based machine or NVIDIA GPUs to build the worker.
12+
- Over 3x lighter Docker image size.
13+
- OpenAI Chat Completion output format (optional to use).
14+
- Extremely fast image build time.
15+
- Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token.
16+
- Support for `n` and `best_of` sampling parameters, which allow you to generate multiple responses from a single prompt.
17+
- New environment variables for various configuration.
18+
- vLLM Version: 0.2.7
19+
1020
## Table of Contents
1121
- [Setting up the Serverless Worker](#setting-up-the-serverless-worker)
12-
- [Option 1: Deploy Any Model Using Pre-Built Docker Image](#option-1-deploy-any-model-using-pre-built-docker-image)
22+
- [Option 1: Deploy Any Model Using Pre-Built Docker Image [**RECOMMENDED**]](#option-1-deploy-any-model-using-pre-built-docker-image-recommended)
1323
- [Prerequisites](#prerequisites)
1424
- [Environment Variables](#environment-variables)
1525
- [Option 2: Build Docker Image with Model Inside](#option-2-build-docker-image-with-model-inside)
@@ -26,13 +36,13 @@
2636

2737
## Setting up the Serverless Worker
2838

29-
### Option 1: Deploy Any Model Using Pre-Built Docker Image
39+
### Option 1: Deploy Any Model Using Pre-Built Docker Image [Recommended]
3040

3141
We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:
3242

3343
<div align="center">
3444

35-
Stable Image: ```runpod/worker-vllm:0.1.0```
45+
Stable Image: ```runpod/worker-vllm:0.2.0```
3646

3747
Development Image: ```runpod/worker-vllm:dev```
3848

@@ -43,37 +53,52 @@ Development Image: ```runpod/worker-vllm:dev```
4353

4454
#### Environment Variables
4555

46-
- **Required**:
56+
**Required**:
4757
- `MODEL_NAME`: Hugging Face Model Repository (e.g., `openchat/openchat-3.5-1210`).
4858

49-
- **Optional**:
59+
**Optional**:
60+
- Model Settings:
5061
- `MAX_MODEL_LENGTH`: Maximum number of tokens for the engine to be able to handle. (default: maximum supported by the model)
5162
- `MODEL_BASE_PATH`: Model storage directory (default: `/runpod-volume`).
63+
- `LOAD_FORMAT`: Format to load model in (default: `auto`).
5264
- `HF_TOKEN`: Hugging Face token for private and gated models (e.g., Llama, Falcon).
53-
- `NUM_GPU_SHARD`: Number of GPUs to split the model across. (default: `1`)
5465
- `QUANTIZATION`: AWQ (`awq`), SqueezeLLM (`squeezellm`) or GPTQ (`gptq`) Quantization. The specified Model Repo must be of a quantized model. (default: `None`)
66+
- `TRUST_REMOTE_CODE`: Trust remote code for Hugging Face (default: `0`)
67+
68+
- Tensor Parallelism:
69+
70+
Note that the more GPUs you split a model's weights accross, the slower it will be due to inter-GPU communication overhead. If you can fit the model on a single GPU, it is recommended to do so.
71+
- `USE_TENSOR_PARALLEL`: Enable (`1`) or disable (`0`) Tensor Parallelism. (default: `0`)
72+
- `TENSOR_PARALLEL_SIZE`: Number of GPUs to shard the model across (default: `1`).
73+
74+
- System Settings:
75+
- `GPU_MEMORY_UTILIZATION`: GPU VRAM utilization (default: `0.98`).
76+
- `MAX_PARALLEL_LOADING_WORKERS`: Maximum number of parallel workers for loading models (default: `number of available CPU cores`).
77+
78+
79+
- Serverless Settings:
5580
- `MAX_CONCURRENCY`: Max concurrent requests. (default: `100`)
5681
- `DEFAULT_BATCH_SIZE`: Token streaming batch size (default: `30`). This reduces the number of HTTP calls, increasing speed 8-10x vs non-batching, matching non-streaming performance.
82+
- `ALLOW_OPENAI_FORMAT`: Whether to allow users to specify `use_openai_format` to get output in OpenAI format. (default: `1`)
5783
- `DISABLE_LOG_STATS`: Enable (`0`) or disable (`1`) vLLM stats logging.
5884
- `DISABLE_LOG_REQUESTS`: Enable (`0`) or disable (`1`) request logging.
5985

6086
### Option 2: Build Docker Image with Model Inside
6187
To build an image with the model baked in, you must specify the following docker arguments when building the image.
6288

6389
#### Prerequisites
90+
- RunPod Account
6491
- Docker
65-
- Linux
66-
- NVIDIA GPU
67-
> [!NOTE]
68-
> We will be adding support for building on any OS without a GPU.
6992

7093
#### Arguments:
7194
- **Required**
7295
- `MODEL_NAME`
7396
- **Optional**
7497
- `MODEL_BASE_PATH`: Defaults to `/runpod-volume` for network storage. Use `/models` or for local container storage.
7598
- `QUANTIZATION`
76-
- `WORKER_CUDA_VERSION`: `11.8` or `12.1` (default: `11.8` due to a small amount of workers not having CUDA 12.1 support yet. `12.1` is recommended for optimal performance).
99+
- `WORKER_CUDA_VERSION`: `11.8.0` or `12.1.0` (default: `11.8.0` due to a small amount of workers not having CUDA 12.1 support yet. `12.1.0` is recommended for optimal performance).
100+
101+
For the remaining settings, you may apply them as environment variables when running the container. Supported environment variables are listed in the [Environment Variables](#environment-variables) section.
77102

78103
#### Example: Building an image with OpenChat-3.5
79104
```bash
@@ -88,22 +113,27 @@ export DOCKER_BUILDKIT=1
88113
```
89114
2. Export your Hugging Face token as an environment variable
90115
```bash
91-
export HF_TOKEN="your_secret_value_here"
116+
export HF_TOKEN="your_token_here"
92117
```
93118
2. Add the token as a secret when building
94119
```bash
95120
docker build -t username/image:tag --secret id=HF_TOKEN --build-arg MODEL_NAME="openchat/openchat_3.5" .
96121
```
97122

98-
### Compatible Models
99-
100-
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
123+
### Compatible Model Architectures
101124
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
102125
- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, etc.)
126+
- Phi (`microsoft/phi-1_5`, `microsoft/phi-2`, etc.)
127+
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
128+
- Qwen2 (`Qwen/Qwen2-7B-beta`, `Qwen/Qwen-7B-Chat-beta`, etc.)
129+
- StableLM(`stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.)
130+
- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)
131+
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
103132
- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
104133
- Baichuan & Baichuan2 (`baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.)
105134
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
106135
- ChatGLM (`THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, etc.)
136+
- DeciLM (`Deci/DeciLM-7B`, `Deci/DeciLM-7B-instruct`, etc.)
107137
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
108138
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
109139
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
@@ -112,14 +142,6 @@ docker build -t username/image:tag --secret id=HF_TOKEN --build-arg MODEL_NAME="
112142
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
113143
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
114144
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
115-
- Phi (`microsoft/phi-1_5`, `microsoft/phi-2`, etc.)
116-
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)
117-
- Yi (`01-ai/Yi-6B`, `01-ai/Yi-34B`, etc.)
118-
119-
And any other models supported by vLLM 0.2.6.
120-
121-
122-
Ensure that you have Docker installed and properly set up before running the docker build commands. Once built, you can deploy this serverless worker in your desired environment with confidence that it will automatically scale based on demand. For further inquiries or assistance, feel free to contact our support team.
123145

124146

125147
## Usage
@@ -129,6 +151,7 @@ You may either use a `prompt` or a list of `messages` as input. If you use `mess
129151
|-----------------------|----------------------|--------------------|--------------------------------------------------------------------------------------------------------|
130152
| `prompt` | str | | Prompt string to generate text based on. |
131153
| `messages` | list[dict[str, str]] | | List of messages, which will automatically have the model's chat template applied. Overrides `prompt`. |
154+
| `use_openai_format` | bool | False | Whether to return output in OpenAI format. `ALLOW_OPENAI_FORMAT` environment variable must be `1`, the input must be a `messages` list, and `stream` enabled. |
132155
| `apply_chat_template` | bool | False | Whether to apply the model's chat template to the `prompt`. |
133156
| `sampling_params` | dict | {} | Sampling parameters to control the generation, like temperature, top_p, etc. |
134157
| `stream` | bool | False | Whether to enable streaming of output. If True, responses are streamed as they are generated. |

builder/requirements.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
hf_transfer
2+
ray
3+
pandas
4+
pyarrow
25
runpod==1.5.2
36
huggingface-hub
47
packaging
58
typing-extensions==4.7.1
6-
pydantic
9+
pydantic

src/constants.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,4 +24,5 @@
2424
"prompt_logprobs": int,
2525
"skip_special_tokens": bool,
2626
"spaces_between_special_tokens": bool,
27+
"include_stop_str_in_output": bool
2728
}

0 commit comments

Comments
 (0)