Open
Description
Is your feature request related to a problem? Please describe.
This is a feature request that allows users to run larger context lengths by shrinking the kv cache memory usage through quantization.
Describe the solution you'd like
First we will demonstrate the kv cache quantization through using HF's QuantizedCache
and then we will expand the code to work with vllm. Current solution dequantizes the cache every time, concatenates it with the current state and the quantizes again with new scale/zp.
Describe alternatives you've considered
There are several alternative implementations, but these will be up for explored in the future as we implement different algorithms.
Additional context
Add any other context about the feature request here.
Metadata
Metadata
Assignees
Labels
No labels