Skip to content

[Good First Issue][NNCF]: Use stateful model in LLM compression examples #3491

Open
@alexsu52

Description

@alexsu52

Context

A “stateful model” is a model that implicitly preserves data between two consecutive inference calls such as KV cache for LLMs (more details). Using a stateful model in the inference allows us to minimize the overhead of processing the KV cache, and due to this and additional optimizations, significantly speed up the inference of the model. OpenVINO currently export LLM from PyTorch to OpenVINO IR as stateful model by default. Thus, NNCF should also demonstrate the default flow in its examples.

What needs to be done?

Update the following LLM compression examples to use stateful model:

Example Pull Requests

#3490

Resources

Contact points

@ljaljushkin
@alexsu52

Ticket

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Assigned

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions