From 9ecbb12f4761addc5de204f45bc87dda69525125 Mon Sep 17 00:00:00 2001 From: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Date: Fri, 10 Jan 2025 12:59:33 +0700 Subject: [PATCH 1/6] [Doc] available links path --- docs/source/models/supported_models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md index acbe27a22a679..c92851b193635 100644 --- a/docs/source/models/supported_models.md +++ b/docs/source/models/supported_models.md @@ -843,5 +843,5 @@ We have the following levels of testing for models: 1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test. 2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test. -3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:main/examples) for the models that have passed this test. +3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test. 4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category. From 733e2dbfc9507f7f95ed8ab51b4409a42818d477 Mon Sep 17 00:00:00 2001 From: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Date: Fri, 10 Jan 2025 13:02:09 +0700 Subject: [PATCH 2/6] [Doc] links path tensorizer example --- docs/source/models/extensions/tensorizer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/models/extensions/tensorizer.md b/docs/source/models/extensions/tensorizer.md index ae17e3437bca6..42ed5c795dd27 100644 --- a/docs/source/models/extensions/tensorizer.md +++ b/docs/source/models/extensions/tensorizer.md @@ -9,7 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor For more information on CoreWeave's Tensorizer, please refer to [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see -the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference/tensorize_vllm_model.html). +the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html). ```{note} Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. From 8e17317d6d925adc8790030a9de27f8917f3d3b2 Mon Sep 17 00:00:00 2001 From: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Date: Fri, 10 Jan 2025 13:07:33 +0700 Subject: [PATCH 3/6] [Doc] branch links --- docs/source/getting_started/installation/gpu-rocm.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/getting_started/installation/gpu-rocm.md b/docs/source/getting_started/installation/gpu-rocm.md index e36b92513e31d..8f0090e465567 100644 --- a/docs/source/getting_started/installation/gpu-rocm.md +++ b/docs/source/getting_started/installation/gpu-rocm.md @@ -106,9 +106,9 @@ $ cd ../.. - If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent. ``` -2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck_tile) +2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck-tile-seqlen1) -Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck_tile#amd-gpurocm-support) +Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck-tile-seqlen1#amd-gpurocm-support) Alternatively, wheels intended for vLLM use can be accessed under the releases. For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`. From 9fcd4b178653d412c51034e38f3fa8c8a04f563c Mon Sep 17 00:00:00 2001 From: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Date: Fri, 10 Jan 2025 14:58:52 +0700 Subject: [PATCH 4/6] [Doc] example script --- docs/source/models/extensions/tensorizer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/models/extensions/tensorizer.md b/docs/source/models/extensions/tensorizer.md index 42ed5c795dd27..1991694a8d66d 100644 --- a/docs/source/models/extensions/tensorizer.md +++ b/docs/source/models/extensions/tensorizer.md @@ -9,7 +9,7 @@ shorter Pod startup times and CPU memory usage. Tensor encryption is also suppor For more information on CoreWeave's Tensorizer, please refer to [CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see -the [vLLM example script](https://docs.vllm.ai/en/stable/getting_started/examples/tensorize_vllm_model.html). +the [vLLM example script](gh-file:examples/offline_inference/tensorize_vllm_model.py). ```{note} Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`. From f0ecc168f3070be26cbb2b295292b9b0a2ac1c5c Mon Sep 17 00:00:00 2001 From: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Date: Fri, 10 Jan 2025 15:48:01 +0700 Subject: [PATCH 5/6] [Doc] links followed FA_BRANCH --- docs/source/getting_started/installation/gpu-rocm.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/getting_started/installation/gpu-rocm.md b/docs/source/getting_started/installation/gpu-rocm.md index 8f0090e465567..323b9601cd0a1 100644 --- a/docs/source/getting_started/installation/gpu-rocm.md +++ b/docs/source/getting_started/installation/gpu-rocm.md @@ -106,9 +106,9 @@ $ cd ../.. - If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent. ``` -2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/ck-tile-seqlen1) +2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/b7d29fb) -Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/ck-tile-seqlen1#amd-gpurocm-support) +Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/b7d29fb#amd-rocm-support) Alternatively, wheels intended for vLLM use can be accessed under the releases. For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`. From 606178b7f8f5e45c27748c8b380112a26bffd16c Mon Sep 17 00:00:00 2001 From: Guspan Tanadi <36249910+guspan-tanadi@users.noreply.github.com> Date: Fri, 10 Jan 2025 17:15:23 +0700 Subject: [PATCH 6/6] [Doc] recent version number --- docs/source/getting_started/installation/gpu-rocm.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/getting_started/installation/gpu-rocm.md b/docs/source/getting_started/installation/gpu-rocm.md index 323b9601cd0a1..f0dab099e029b 100644 --- a/docs/source/getting_started/installation/gpu-rocm.md +++ b/docs/source/getting_started/installation/gpu-rocm.md @@ -108,7 +108,7 @@ $ cd ../.. 2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention/tree/b7d29fb) -Install ROCm's flash attention (v2.5.9.post1) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/b7d29fb#amd-rocm-support) +Install ROCm's flash attention (v2.7.0-cktile) following the instructions from [ROCm/flash-attention](https://github.com/ROCm/flash-attention/tree/b7d29fb#amd-rocm-support) Alternatively, wheels intended for vLLM use can be accessed under the releases. For example, for ROCm 6.2, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.