Quantized KV cache

## Is your feature request related to a problem? Please describe.

This is a feature request that allows users to run larger context lengths by shrinking the kv cache memory usage through quantization.

## Describe the solution you'd like

First we will demonstrate the kv cache quantization through using HF's `QuantizedCache` and then we will expand the code to work with vllm. Current solution dequantizes the cache every time, concatenates it with the current state and the quantizes again with new scale/zp. 

## Describe alternatives you've considered

There are several alternative implementations, but these will be up for explored in the future as we implement different algorithms. 

## Additional context

Add any other context about the feature request here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Quantized KV cache #107

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantized KV cache #107

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions