-
-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) #12010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
5c04292 to
ab1c832
Compare
d1662df to
8a2ce5f
Compare
58e7d72 to
0c1f134
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
67cd337 to
a6feb86
Compare
docs/features/quantization/README.md
Outdated
| - [BitBLAS](bitblas.md) | ||
| - [GGUF](gguf.md) | ||
| - [GPTQModel](gptqmodel.md) | ||
| - [Inc](inc.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the doc, seems like it should be INC
| - [Inc](inc.md) | |
| - [INC](inc.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
|
||
| BlockSize = Literal[1, 8, 16, 32, 64, 128] | ||
| CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2"] | ||
| CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2", "fp8_inc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comment here, can we remove the new cache dtype now? #12010 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the HPU worker for v1 is moved to plugin and v0 will be deprecated soon. We want to make the map "fp8_inc to fp8_e4m3" being more visible.
Alternatively, do you think we can update the mapping function above conditional like:
STR_DTYPE_TO_TORCH_DTYPE = {
"half": torch.half,
"bfloat16": torch.bfloat16,
"float": torch.float,
"fp8": torch.uint8,
"fp8_e4m3": torch.uint8 if not current_platform.is_support_fp8_e4m3() else torch.float8_e4m3fn
"fp8_e5m2": torch.uint8,
"int8": torch.int8,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay, let's just keep fp8_inc then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a header to this file explaining its purpose? It is rather confusing otherwise and this is a good place to define how this quant method works
| model = initialize_model(vllm_config=vllm_config, | ||
| model_config=model_config) | ||
|
|
||
| logger.info("Loading weights on %s ...", load_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| # GGUF doesn't have config file | ||
| if model_config.quantization == "gguf": | ||
| return quant_cls.from_config({}) | ||
| if model_config.quantization in ("gguf", "inc"): | ||
| return quant_cls() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this valid for gguf ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes as gguf also returns always the cls straightforward
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/gguf.py#L51
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Uri Livne <[email protected]>
Signed-off-by: Uri Livne <[email protected]>
|
This pull request has merge conflicts that must be resolved before it can be |
|
@mgoin , please help to review the PR again, we have updated the codes and resolved most of your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the iterations. Please resolve the merge conflict
|
|
||
| BlockSize = Literal[1, 8, 16, 32, 64, 128] | ||
| CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2"] | ||
| CacheDType = Literal["auto", "fp8", "fp8_e4m3", "fp8_e5m2", "fp8_inc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's okay, let's just keep fp8_inc then
Signed-off-by: Uri Livne <[email protected]>
Signed-off-by: Uri Livne <[email protected]>
|
@robertgshaw2-redhat @simon-mo @WoosukKwon , We have got one approval from Michael |
|
@mgoin , thanks for the Review, all CI are passed |
… INC (Intel Neural Compressor) (vllm-project#12010) Signed-off-by: Nir David <[email protected]> Signed-off-by: Uri Livne <[email protected]> Co-authored-by: Uri Livne <[email protected]> Signed-off-by: x22x22 <[email protected]>
… INC (Intel Neural Compressor) (vllm-project#12010) Signed-off-by: Nir David <[email protected]> Signed-off-by: Uri Livne <[email protected]> Co-authored-by: Uri Livne <[email protected]>
… INC (Intel Neural Compressor) (vllm-project#12010) Signed-off-by: Nir David <[email protected]> Signed-off-by: Uri Livne <[email protected]> Co-authored-by: Uri Livne <[email protected]>
… INC (Intel Neural Compressor) (vllm-project#12010) Signed-off-by: Nir David <[email protected]> Signed-off-by: Uri Livne <[email protected]> Co-authored-by: Uri Livne <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
… INC (Intel Neural Compressor) (vllm-project#12010) Signed-off-by: Nir David <[email protected]> Signed-off-by: Uri Livne <[email protected]> Co-authored-by: Uri Livne <[email protected]> Signed-off-by: Paul Pak <[email protected]>
… INC (Intel Neural Compressor) (vllm-project#12010) Signed-off-by: Nir David <[email protected]> Signed-off-by: Uri Livne <[email protected]> Co-authored-by: Uri Livne <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
… INC (Intel Neural Compressor) (vllm-project#12010) Signed-off-by: Nir David <[email protected]> Signed-off-by: Uri Livne <[email protected]> Co-authored-by: Uri Livne <[email protected]>

This PR adds support to FP8 Quantization and Inference run on Intel Gaudi (HPU) using INC (Intel Neural Compressor).
Currently, quantization is validated only in Llama models.
Measurements are device dependent - Don't use measurements collected on Gaudi3 on Gaudi2 accelerators as it might cause accuracy issues.
Running Inference in FP8 with INC:
Specify quantization method "inc" and kv cache dtype "fp8_inc" as parameters to the the LLM object.
It will require to set an environment variable "QUANT_CONFIG" which will point to a 'JSON config file' (https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html#supported-json-config-file-options) in QUANTIZE mode. Make sure there are measurement files/scale files in the folder specified as the "dump_stats_path" in the json config file. (If none exists, scale files are generated during the inference run using the measurement files)
At the end of the run, the model executor's shutdown method must be called.
More information on vLLM quantization using INC will be shown here (added in this PR): https://github.com/vllm-project/vllm/blob/main/docs/source/features/quantization/inc.md
This PR adds a new flag "weights_load_device" which allows uploading the model's (unquantized) weights onto a different device than the device on which the model will run. If not provided the behavior is kept by using the device specified in the device config.