-
-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010
Changes from 31 commits
2107403
cb7aa63
54035b8
274755c
af44fbd
4876f10
cebdcef
19e426e
1f66b75
22440be
67c9285
e8675b2
3e9eb49
f11ff57
86785a9
651c372
5e9a52d
bc6ac1e
66c3513
d7a18af
e6e0829
82c1bac
d5b29df
a04deb2
82af219
deefe48
e4dcfc3
6c48bfd
08ae540
add32bf
9f9dc69
a4bc3cc
664707b
bddcef0
f557ab6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| --- | ||
| title: FP8 INC | ||
| --- | ||
| [](){ #inc } | ||
|
|
||
| vLLM supports FP8 (8-bit floating point) weight and activation quantization using Intel® Neural Compressor (INC) on Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerators. | ||
| Currently, quantization is validated only in Llama models. | ||
|
|
||
| Intel Gaudi supports quantization of various modules and functions, including, but not limited to `Linear`, `KVCache`, `Matmul` and `Softmax`. For more information, please refer to: | ||
| [Supported Modules\\Supported Functions\\Custom Patched Modules](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-modules). | ||
|
|
||
| !!! note | ||
| Measurement files are required to run quantized models with vLLM on Gaudi accelerators. The FP8 model calibration procedure is described in the [vllm-hpu-extention](https://github.com/HabanaAI/vllm-hpu-extension/tree/main/calibration/README.md) package. | ||
|
|
||
| !!! note | ||
| `QUANT_CONFIG` is an environment variable that points to the measurement or quantization [JSON config file](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options). | ||
| The measurement configuration file is used during the calibration procedure to collect measurements for a given model. The quantization configuration is used during inference. | ||
|
|
||
| ## Run Online Inference Using FP8 | ||
|
|
||
| Once you've completed the model calibration process and collected the measurements, you can run FP8 inference with vLLM using the following command: | ||
|
|
||
| ```bash | ||
| export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_measure_g3.json | ||
| vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --tensor_paralel_size 8 | ||
| ``` | ||
|
|
||
| !!! tip | ||
| If you are just prototyping or testing your model with FP8, you can use the `VLLM_SKIP_WARMUP=true` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments as it causes a significant performance drop. | ||
|
|
||
| !!! tip | ||
| When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use the below environment variables: | ||
| `VLLM_ENGINE_ITERATION_TIMEOUT_S` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes. | ||
| `VLLM_RPC_TIMEOUT` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes. | ||
|
|
||
| ## Run Offline Inference Using FP8 | ||
|
|
||
| To run offline inference (after completing the model calibration process): | ||
|
|
||
| * Set the "QUANT_CONFIG" environment variable to point to a JSON configuration file with QUANTIZE mode. | ||
| * Pass `quantization=inc` and `kv_cache_dtype=fp8_inc` as parameters to the `LLM` object. | ||
| * Call shutdown method of the model_executor at the end of the run. | ||
|
|
||
| ```python | ||
| from vllm import LLM | ||
| llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc") | ||
| ... | ||
| # Call llm.generate on the required prompts and sampling params. | ||
| ... | ||
| llm.llm_engine.model_executor.shutdown() | ||
| ``` | ||
|
|
||
| ## Device for the Model's Weights Uploading | ||
|
|
||
| The unquantized weights are first loaded onto the CPU, then quantized and transferred to the target device (HPU) for model execution. | ||
| This reduces the device memory footprint of model weights, as only quantized weights are stored in the device memory. |
nirda7 marked this conversation as resolved.
Show resolved
Hide resolved
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -863,7 +863,7 @@ def _verify_quantization(self) -> None: | |
| optimized_quantization_methods = [ | ||
| "fp8", "marlin", "modelopt", "gptq_marlin_24", "gptq_marlin", | ||
| "awq_marlin", "fbgemm_fp8", "compressed-tensors", "experts_int8", | ||
| "quark", "modelopt_fp4", "bitblas", "gptq_bitblas" | ||
| "quark", "modelopt_fp4", "bitblas", "gptq_bitblas", "inc" | ||
| ] | ||
| if self.quantization is not None: | ||
| self.quantization = cast(QuantizationMethods, self.quantization) | ||
|
|
@@ -1446,7 +1446,7 @@ def get_and_verify_max_len(self, max_model_len: int): | |
|
|
||
|
|
||
| BlockSize = Literal[1, 8, 16, 32, 64, 128] | ||
| CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2"] | ||
| CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2", "fp8_inc"] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on the comment here, can we remove the new cache dtype now? #12010 (comment)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the HPU worker for v1 is moved to plugin and v0 will be deprecated soon. We want to make the map "fp8_inc to fp8_e4m3" being more visible. Alternatively, do you think we can update the mapping function above conditional like:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's okay, let's just keep fp8_inc then |
||
| PrefixCachingHashAlgo = Literal["builtin", "sha256"] | ||
|
|
||
|
|
||
|
|
@@ -1476,7 +1476,7 @@ class CacheConfig: | |
| cache_dtype: CacheDType = "auto" | ||
| """Data type for kv cache storage. If "auto", will use model data type. | ||
| CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. ROCm (AMD GPU) supports | ||
| fp8 (=fp8_e4m3).""" | ||
| fp8 (=fp8_e4m3). Intel Gaudi (HPU) supports fp8 (using fp8_inc).""" | ||
| is_attention_free: bool = False | ||
| """Whether the model is attention-free. This is primarily set in | ||
| `ModelConfig` and that value should be manually duplicated here.""" | ||
|
|
@@ -1566,7 +1566,7 @@ def _verify_cache_dtype(self) -> None: | |
| "Using fp8 data type to store kv cache. It reduces the GPU " | ||
| "memory footprint and boosts the performance. " | ||
| "Meanwhile, it may cause accuracy drop without a proper " | ||
| "scaling factor") | ||
| "scaling factor.") | ||
| else: | ||
| raise ValueError(f"Unknown kv cache dtype: {self.cache_dtype}") | ||
|
|
||
|
|
@@ -1685,6 +1685,9 @@ class LoadConfig: | |
| default_factory=dict) | ||
| """Extra config for model loader. This will be passed to the model loader | ||
| corresponding to the chosen load_format.""" | ||
| device: Optional[str] = None | ||
| """Device to which model weights will be loaded, default to | ||
| device_config.device""" | ||
| ignore_patterns: Optional[Union[list[str], str]] = None | ||
| """The list of patterns to ignore when loading the model. Default to | ||
| "original/**/*" to avoid repeated loading of llama's checkpoints.""" | ||
|
|
@@ -1792,7 +1795,7 @@ class ParallelConfig: | |
| or equal to the number of GPUs available, "mp" will be used to | ||
| keep processing on a single host. Otherwise, this will default | ||
| to "ray" if Ray is installed and fail otherwise. Note that tpu | ||
| and hpu only support Ray for distributed inference.""" | ||
| only support Ray for distributed inference.""" | ||
|
|
||
| worker_cls: str = "auto" | ||
| """The full name of the worker class to use. If "auto", the worker class | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -170,6 +170,10 @@ def get_type_hints(type_hint: TypeHint) -> set[TypeHint]: | |
| return type_hints | ||
|
|
||
|
|
||
| def is_online_quantization(quantization: Any) -> bool: | ||
| return quantization in ["inc"] | ||
|
|
||
|
|
||
| @functools.lru_cache(maxsize=30) | ||
| def _compute_kwargs(cls: ConfigType) -> dict[str, Any]: | ||
| cls_docs = get_attr_docs(cls) | ||
|
|
@@ -973,6 +977,8 @@ def create_load_config(self) -> LoadConfig: | |
| return LoadConfig( | ||
| load_format=self.load_format, | ||
| download_dir=self.download_dir, | ||
| device="cpu" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If weights are firstly loaded to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. they are moved by underline INC logic, after quantization to fp8 |
||
| if is_online_quantization(self.quantization) else None, | ||
| model_loader_extra_config=self.model_loader_extra_config, | ||
| ignore_patterns=self.ignore_patterns, | ||
| use_tqdm_on_load=self.use_tqdm_on_load, | ||
|
|
@@ -1332,7 +1338,8 @@ def _is_v1_supported_oracle(self, model_config: ModelConfig) -> bool: | |
| and not envs.is_set("VLLM_ATTENTION_BACKEND") | ||
| ) or envs.VLLM_ATTENTION_BACKEND == "FLASH_ATTN_VLLM_V1" | ||
| supported = False | ||
| if current_platform.is_rocm(): | ||
| if (current_platform.is_rocm() or current_platform.device_name | ||
| == "hpu"): # handle hpu also for OOT platform | ||
| supported = True | ||
| elif fp8_attention and will_use_fa: | ||
| from vllm.attention.utils.fa_utils import ( | ||
|
|
||
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| from typing import Any, Optional | ||
|
|
||
| import torch | ||
|
|
||
| from vllm.model_executor.layers.fused_moe.layer import ( | ||
| FusedMoE, UnquantizedFusedMoEMethod) | ||
| from vllm.model_executor.layers.linear import (LinearBase, | ||
| UnquantizedLinearMethod) | ||
| from vllm.model_executor.layers.quantization import QuantizationMethods | ||
| from vllm.model_executor.layers.quantization.base_config import ( | ||
| QuantizationConfig, QuantizeMethodBase) | ||
|
|
||
|
|
||
| class INCConfig(QuantizationConfig): | ||
| """Config class for FP8 using Intel Neural Compressor.""" | ||
|
|
||
| @classmethod | ||
| def get_name(cls) -> QuantizationMethods: | ||
| return "inc" | ||
|
|
||
| @classmethod | ||
| def get_supported_act_dtypes(cls) -> list[torch.dtype]: | ||
| return [torch.bfloat16] | ||
|
|
||
| @classmethod | ||
| def from_config(cls, config: dict[str, Any]) -> "INCConfig": | ||
| raise AssertionError | ||
|
|
||
| def get_quant_method(self, layer: torch.nn.Module, | ||
| prefix: str) -> Optional["QuantizeMethodBase"]: | ||
| if isinstance(layer, LinearBase): | ||
| return UnquantizedLinearMethod() | ||
| elif isinstance(layer, FusedMoE): | ||
| return UnquantizedFusedMoEMethod(layer.moe_config) | ||
| return None | ||
|
|
||
| @classmethod | ||
| def get_min_capability(cls) -> int: | ||
| raise AssertionError | ||
|
|
||
| @staticmethod | ||
| def get_config_filenames() -> list[str]: | ||
| return [] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,9 +6,12 @@ | |
| import torch.nn as nn | ||
|
|
||
| from vllm.config import LoadConfig, ModelConfig, VllmConfig | ||
| from vllm.logger import init_logger | ||
| from vllm.model_executor.model_loader.utils import ( | ||
| initialize_model, process_weights_after_loading, set_default_torch_dtype) | ||
|
|
||
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| class BaseModelLoader(ABC): | ||
| """Base class for model loaders.""" | ||
|
|
@@ -32,11 +35,16 @@ def load_model(self, vllm_config: VllmConfig, | |
| model_config: ModelConfig) -> nn.Module: | ||
| """Load a model with the given configurations.""" | ||
| device_config = vllm_config.device_config | ||
| target_device = torch.device(device_config.device) | ||
| load_config = vllm_config.load_config | ||
| load_device = device_config.device if load_config.device is None else \ | ||
| load_config.device | ||
| target_device = torch.device(load_device) | ||
| with set_default_torch_dtype(model_config.dtype): | ||
| with target_device: | ||
| model = initialize_model(vllm_config=vllm_config, | ||
| model_config=model_config) | ||
|
|
||
| logger.info("Loading weights on %s ...", load_device) | ||
|
||
| # Quantization does not happen in `load_weights` but after it | ||
| self.load_weights(model, model_config) | ||
| process_weights_after_loading(model, model_config, target_device) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -148,8 +148,8 @@ def get_quant_config(model_config: ModelConfig, | |
| quant_cls = get_quantization_config(model_config.quantization) | ||
|
|
||
| # GGUF doesn't have config file | ||
| if model_config.quantization == "gguf": | ||
| return quant_cls.from_config({}) | ||
| if model_config.quantization in ("gguf", "inc"): | ||
| return quant_cls() | ||
|
Comment on lines
154
to
+156
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this valid for gguf ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes as gguf also returns always the cls straightforward |
||
|
|
||
| # Read the quantization config from the HF model config, if available. | ||
| hf_quant_config = getattr(model_config.hf_config, "quantization_config", | ||
|
|
||

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the doc, seems like it should be INC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done