Skip to content

Quantized KV cache #107

Open
Open
@tharapalanivel

Description

@tharapalanivel

Is your feature request related to a problem? Please describe.

This is a feature request that allows users to run larger context lengths by shrinking the kv cache memory usage through quantization.

Describe the solution you'd like

First we will demonstrate the kv cache quantization through using HF's QuantizedCache and then we will expand the code to work with vllm. Current solution dequantizes the cache every time, concatenates it with the current state and the quantizes again with new scale/zp.

Describe alternatives you've considered

There are several alternative implementations, but these will be up for explored in the future as we implement different algorithms.

Additional context

Add any other context about the feature request here.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions