Cache the model’s state to pre-eval large responses #11022

CodeDruidX · 2024-12-30T17:15:55Z

CodeDruidX
Dec 30, 2024

Is it possible to partially cache the model’s response in the same way as the --prompt-cache ?. The thing is that I need to re-generate very large responses to the same short prompt, varying only the seed to generate the very last token of response.

I understand that this task would be better handled by the classic GPT architecture without internal state, but I would like to implement something similar with llama. It seems to me that for this i need to somehow learn how to save the internal state of the model for further reuse...

ggerganov · 2025-01-03T09:22:22Z

ggerganov
Jan 3, 2025
Maintainer

llama-server already supports this.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache the model’s state to pre-eval large responses #11022

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Cache the model’s state to pre-eval large responses #11022

CodeDruidX Dec 30, 2024

Replies: 1 comment

ggerganov Jan 3, 2025 Maintainer

CodeDruidX
Dec 30, 2024

ggerganov
Jan 3, 2025
Maintainer