You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- You no longer need a linux-based machine or NVIDIA GPUs to build the worker.
- Over 3x lighter Docker image size.
- OpenAI Chat Completion output format (optional to use).
- Extremely fast image build time.
- Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token.
- Support for `n` and `best_of` sampling parameters, which allow you to generate multiple responses from a single prompt.
- New environment variables for various configuration.
- vLLM Version: 0.2.7
🚀 | This serverless worker utilizes vLLM behind the scenes and is integrated into RunPod's serverless environment. It supports dynamic auto-scaling using the built-in RunPod autoscaling feature.
7
+
Deploy Blazing-fast LLMs powered by [vLLM](https://github.com/vllm-project/vllm) on RunPod Serverless in a few clicks.
8
8
</div>
9
9
10
+
### Worker vLLM 0.2.0 - What's New
11
+
- You no longer need a linux-based machine or NVIDIA GPUs to build the worker.
12
+
- Over 3x lighter Docker image size.
13
+
- OpenAI Chat Completion output format (optional to use).
14
+
- Extremely fast image build time.
15
+
- Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token.
16
+
- Support for `n` and `best_of` sampling parameters, which allow you to generate multiple responses from a single prompt.
17
+
- New environment variables for various configuration.
18
+
- vLLM Version: 0.2.7
19
+
10
20
## Table of Contents
11
21
-[Setting up the Serverless Worker](#setting-up-the-serverless-worker)
12
-
-[Option 1: Deploy Any Model Using Pre-Built Docker Image](#option-1-deploy-any-model-using-pre-built-docker-image)
22
+
-[Option 1: Deploy Any Model Using Pre-Built Docker Image[**RECOMMENDED**]](#option-1-deploy-any-model-using-pre-built-docker-image-recommended)
13
23
-[Prerequisites](#prerequisites)
14
24
-[Environment Variables](#environment-variables)
15
25
-[Option 2: Build Docker Image with Model Inside](#option-2-build-docker-image-with-model-inside)
@@ -26,13 +36,13 @@
26
36
27
37
## Setting up the Serverless Worker
28
38
29
-
### Option 1: Deploy Any Model Using Pre-Built Docker Image
39
+
### Option 1: Deploy Any Model Using Pre-Built Docker Image[Recommended]
30
40
31
41
We now offer a pre-built Docker Image for the vLLM Worker that you can configure entirely with Environment Variables when creating the RunPod Serverless Endpoint:
32
42
33
43
<divalign="center">
34
44
35
-
Stable Image: ```runpod/worker-vllm:0.1.0```
45
+
Stable Image: ```runpod/worker-vllm:0.2.0```
36
46
37
47
Development Image: ```runpod/worker-vllm:dev```
38
48
@@ -43,37 +53,52 @@ Development Image: ```runpod/worker-vllm:dev```
43
53
44
54
#### Environment Variables
45
55
46
-
-**Required**:
56
+
**Required**:
47
57
-`MODEL_NAME`: Hugging Face Model Repository (e.g., `openchat/openchat-3.5-1210`).
48
58
49
-
-**Optional**:
59
+
**Optional**:
60
+
- Model Settings:
50
61
-`MAX_MODEL_LENGTH`: Maximum number of tokens for the engine to be able to handle. (default: maximum supported by the model)
51
62
-`MODEL_BASE_PATH`: Model storage directory (default: `/runpod-volume`).
63
+
-`LOAD_FORMAT`: Format to load model in (default: `auto`).
52
64
-`HF_TOKEN`: Hugging Face token for private and gated models (e.g., Llama, Falcon).
53
-
-`NUM_GPU_SHARD`: Number of GPUs to split the model across. (default: `1`)
54
65
-`QUANTIZATION`: AWQ (`awq`), SqueezeLLM (`squeezellm`) or GPTQ (`gptq`) Quantization. The specified Model Repo must be of a quantized model. (default: `None`)
66
+
-`TRUST_REMOTE_CODE`: Trust remote code for Hugging Face (default: `0`)
67
+
68
+
- Tensor Parallelism:
69
+
70
+
Note that the more GPUs you split a model's weights accross, the slower it will be due to inter-GPU communication overhead. If you can fit the model on a single GPU, it is recommended to do so.
71
+
-`USE_TENSOR_PARALLEL`: Enable (`1`) or disable (`0`) Tensor Parallelism. (default: `0`)
72
+
-`TENSOR_PARALLEL_SIZE`: Number of GPUs to shard the model across (default: `1`).
-`MAX_PARALLEL_LOADING_WORKERS`: Maximum number of parallel workers for loading models (default: `number of available CPU cores`).
77
+
78
+
79
+
- Serverless Settings:
55
80
-`MAX_CONCURRENCY`: Max concurrent requests. (default: `100`)
56
81
-`DEFAULT_BATCH_SIZE`: Token streaming batch size (default: `30`). This reduces the number of HTTP calls, increasing speed 8-10x vs non-batching, matching non-streaming performance.
82
+
-`ALLOW_OPENAI_FORMAT`: Whether to allow users to specify `use_openai_format` to get output in OpenAI format. (default: `1`)
57
83
-`DISABLE_LOG_STATS`: Enable (`0`) or disable (`1`) vLLM stats logging.
58
84
-`DISABLE_LOG_REQUESTS`: Enable (`0`) or disable (`1`) request logging.
59
85
60
86
### Option 2: Build Docker Image with Model Inside
61
87
To build an image with the model baked in, you must specify the following docker arguments when building the image.
62
88
63
89
#### Prerequisites
90
+
- RunPod Account
64
91
- Docker
65
-
- Linux
66
-
- NVIDIA GPU
67
-
> [!NOTE]
68
-
> We will be adding support for building on any OS without a GPU.
69
92
70
93
#### Arguments:
71
94
-**Required**
72
95
-`MODEL_NAME`
73
96
-**Optional**
74
97
-`MODEL_BASE_PATH`: Defaults to `/runpod-volume` for network storage. Use `/models` or for local container storage.
75
98
-`QUANTIZATION`
76
-
-`WORKER_CUDA_VERSION`: `11.8` or `12.1` (default: `11.8` due to a small amount of workers not having CUDA 12.1 support yet. `12.1` is recommended for optimal performance).
99
+
-`WORKER_CUDA_VERSION`: `11.8.0` or `12.1.0` (default: `11.8.0` due to a small amount of workers not having CUDA 12.1 support yet. `12.1.0` is recommended for optimal performance).
100
+
101
+
For the remaining settings, you may apply them as environment variables when running the container. Supported environment variables are listed in the [Environment Variables](#environment-variables) section.
77
102
78
103
#### Example: Building an image with OpenChat-3.5
79
104
```bash
@@ -88,22 +113,27 @@ export DOCKER_BUILDKIT=1
88
113
```
89
114
2. Export your Hugging Face token as an environment variable
Ensure that you have Docker installed and properly set up before running the docker build commands. Once built, you can deploy this serverless worker in your desired environment with confidence that it will automatically scale based on demand. For further inquiries or assistance, feel free to contact our support team.
123
145
124
146
125
147
## Usage
@@ -129,6 +151,7 @@ You may either use a `prompt` or a list of `messages` as input. If you use `mess
|`prompt`| str || Prompt string to generate text based on. |
131
153
|`messages`| list[dict[str, str]]|| List of messages, which will automatically have the model's chat template applied. Overrides `prompt`. |
154
+
|`use_openai_format`| bool | False | Whether to return output in OpenAI format. `ALLOW_OPENAI_FORMAT` environment variable must be `1`, the input must be a `messages` list, and `stream` enabled. |
132
155
|`apply_chat_template`| bool | False | Whether to apply the model's chat template to the `prompt`. |
133
156
|`sampling_params`| dict | {} | Sampling parameters to control the generation, like temperature, top_p, etc. |
134
157
|`stream`| bool | False | Whether to enable streaming of output. If True, responses are streamed as they are generated. |
0 commit comments