|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Llama 4 in vLLM" |
| 4 | +author: "The vLLM Team" |
| 5 | +image: /assets/figures/llama4/perf.png |
| 6 | +thumbnail-img: /assets/figures/llama4/perf.png |
| 7 | +share-img: /assets/figures/llama4/perf.png |
| 8 | +--- |
| 9 | + |
| 10 | +We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later: |
| 11 | + |
| 12 | +``` |
| 13 | +pip install -U vllm |
| 14 | +``` |
| 15 | +Below, you'll find sample commands to get started. Alternatively, you can replace the CLI command with docker run ([instructions here](https://docs.vllm.ai/en/latest/deployment/docker.html)) or use our Pythonic interface, the [`LLM` class](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference), for local batch inference. We also recommend checking out the [demo from the Meta team](https://github.com/meta-llama/llama-cookbook/blob/main/getting-started/build_with_llama_4.ipynb) showcasing the 1M long context capability with vLLM. |
| 16 | + |
| 17 | +## Usage Guide |
| 18 | + |
| 19 | +Here's how you can serve the Llama 4 models using different hardware configurations. |
| 20 | + |
| 21 | +Using 8xH100, vLLM can serve Scout with 1M context and Maverick with about 430K. See more tips below for performance enhancement and leveraging long context. |
| 22 | + |
| 23 | +On 8x H100 GPUs: |
| 24 | + |
| 25 | +* Scout (up to 1M context): |
| 26 | + |
| 27 | +``` |
| 28 | +vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \ |
| 29 | + --tensor-parallel-size 8 \ |
| 30 | + --max-model-len 1000000 |
| 31 | +``` |
| 32 | + |
| 33 | +* Maverick (up to \~430K context): |
| 34 | + |
| 35 | +``` |
| 36 | +vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \ |
| 37 | + --tensor-parallel-size 8 \ |
| 38 | + --max-model-len 430000 |
| 39 | +``` |
| 40 | + |
| 41 | +On 8x H200 GPUs: |
| 42 | + |
| 43 | +* Scout (up to 3.6M context): |
| 44 | + |
| 45 | +``` |
| 46 | +vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \ |
| 47 | + --tensor-parallel-size 8 \ |
| 48 | + --max-model-len 3600000 |
| 49 | +``` |
| 50 | + |
| 51 | +* Maverick (up to 1M context): |
| 52 | + |
| 53 | +``` |
| 54 | +vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \ |
| 55 | + --tensor-parallel-size 8 |
| 56 | +``` |
| 57 | +**Multimodality:** |
| 58 | + |
| 59 | +The Llama 4 models excel at image understanding up to 8-10 images. By default, vLLM server accepts 1 image per request. Please pass `--limit-mm-per-prompt image=10` to serve up to 10 images per request with OpenAI-compatible API. We also recommend checking out our multi-image offline inference example with Llama-4 [here](https://github.com/vllm-project/vllm/blob/v0.8.3/examples/offline_inference/vision_language_multi_image.py). |
| 60 | + |
| 61 | +**Performance:** |
| 62 | + |
| 63 | +With the configurations above, we observe the following output tokens/s for Scout-BF16 and Maverick-FP8: |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +While more performance enhancements are on the way, we believe the Llama 4 models' efficient architecture and relatively small size make them practical for scaled usage today. |
| 68 | + |
| 69 | +**Tips for Performance and Long Context:** |
| 70 | + |
| 71 | +* **Boost Performance & Context Length:** Set `--kv-cache-dtype fp8` to potentially double the usable context window and gain a performance boost. We observe little to no accuracy drop in relevant evaluations with this setting. |
| 72 | +* **Maximize Context Window (up to 10M):** To fully utilize the maximum context windows (up to 10M for Scout), we recommend serving across multiple nodes using tensor parallelism or pipeline parallelism. Follow our distributed inference guide [here](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). |
| 73 | +* **Improve Long Context Accuracy (\>32K):** We highly recommend adding `--override-generation-config='{"attn_temperature_tuning": true}'` to improve accuracy for contexts longer than 32K tokens. |
| 74 | + |
| 75 | +**Other Hardware Support & Quantizations:** |
| 76 | + |
| 77 | +* A100: We have verified that the bf16 versions of the models work well on A100 GPUs. |
| 78 | +* INT4: An INT4-quantized version of the Scout model checkpoint that fits on a single H100 GPUis currently a work in progress. Stay tuned for updates. |
| 79 | +* AMD MI300X: You can run Llama 4 on AMD MI300X GPUs by building [vLLM from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html?device=rocm) and using the same commands as above. |
| 80 | + |
| 81 | +**Inference Accuracy Validation:** |
| 82 | +We validated inference accuracy against the official Meta report using lm-eval-harness. Here are the results for [meta-llama/Llama-4-Maverick-17B-128E-Instruct](https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Instruct): |
| 83 | + |
| 84 | +| | MMLU Pro | ChartQA | |
| 85 | +|----------|---------|---------| |
| 86 | +| Reported | 80.5 | 90 | |
| 87 | +| H100 FP8 | 80.4 | 89.4 | |
| 88 | +| AMD MI300x BF16 | 80.4 | 89.4 | |
| 89 | +| H200 BF16 | 80.2 | 89.3 | |
| 90 | + |
| 91 | +## Efficient Architecture and Cluster Scale Serving |
| 92 | + |
| 93 | +Llama 4’s model architecture is particularly well-suited for efficient long-context inference, thanks to features like: |
| 94 | + |
| 95 | +* **Mixture of Experts (MoE):** Scout uses 16 experts (17B activated parameters), and Maverick uses 128 experts (17B activated parameters). Only one expert is activated per token, maintaining efficiency. |
| 96 | +* **Interleaved RoPE (iRoPE):** Llama 4 interleaves global attention (without RoPE) with chunked local attention (with RoPE) in a 1:3 ratio. The local attention layer attends to tokens in non-overlapping chunks, significantly reducing the quadratic complexity of attention as context length scales. |
| 97 | + |
| 98 | + |
| 99 | +vLLM recently launched the [V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), delivering major performance speedups on single nodes, along with native torch.compile support. Our [Q2 roadmap](https://github.com/vllm-project/vllm/issues/15735) focuses on enhancing vLLM’s multi-node scaling capabilities, aiming for disaggregated, cluster-scale serving. We are actively adding support for efficient expert parallelism, multi-node data parallelism, and cluster-wide prefill disaggregation. |
| 100 | + |
| 101 | +## Acknowledgement |
| 102 | + |
| 103 | +We extend our sincere thanks to the Meta team for their implementation of the model architecture, extensive accuracy evaluation, and performance benchmarking: [Lucia (Lu) Fang](https://github.com/luccafong), [Ye (Charlotte) Qi](https://github.com/yeqcharlotte), [Lu Fang](https://github.com/houseroad), [Yang Chen](https://github.com/chenyang78), [Zijing Liu](https://github.com/liuzijing2014), [Yong Hoon Shin](https://github.com/sarckk), [Zhewen Li](https://github.com/zhewenl), [Jon Swenson](https://github.com/jmswen), [Kai Wu](https://github.com/wukaixingxp), [Xiaodong Wang](https://github.com/xw285cornell), [Shiyan Deng](https://github.com/842974287), [Wenchen Wang](https://github.com/wangwenchen0407), [Lai Wei](https://github.com/roywei), [Matthias Reso](https://github.com/mreso), [Chris Thi](https://github.com/cthi), [Keyun Tong](https://github.com/youngkent), [Jinho Hwang](https://github.com/jinhohwang-meta), [Driss Guessous](https://github.com/drisspg), [Aston Zhang](https://github.com/astonzhang). |
| 104 | + |
| 105 | +We also thank the AMD team for their support in enabling these models on MI300X: [Hongxia Yang](https://github.com/hongxiayang) and Weijun Jiang. |
| 106 | + |
| 107 | +The vLLM team’s performance benchmarks were run on hardware generously provided by Nebius and NVIDIA. |
| 108 | + |
0 commit comments