Open
Description
Context
A “stateful model” is a model that implicitly preserves data between two consecutive inference calls such as KV cache for LLMs (more details). Using a stateful model in the inference allows us to minimize the overhead of processing the KV cache, and due to this and additional optimizations, significantly speed up the inference of the model. OpenVINO currently export LLM from PyTorch to OpenVINO IR as stateful model by default. Thus, NNCF should also demonstrate the default flow in its examples.
What needs to be done?
Update the following LLM compression examples to use stateful model:
- Large Language Models FP8 Compression Example
- Find the appropriate hyperparameters to compress the TinyLLama model
- Compress TinyLLama model using synthetic data
Example Pull Requests
Resources
- Contribution guide - start here!
- Intel DevHub Discord channel - engage in discussions, ask questions and talk to OpenVINO developers
- How to link your Pull Request to an issue
Contact points
Ticket
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Assigned