Multiple conversation turns in android app #8218
-
I have a short question. I build and uploaded an android app deploying LLama3 (https://bwsyncandshare.kit.edu/s/t3898Ge7AZ6SWBn). I assume that the kv cache is stored in module_ internally and here To use the last conversation turns within the next prompt I tried to start here For performance reasons I don't want to give the whole conversation history multiple times to the model. Best, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
@JacobSzwejbka any ideas on how one might do this? |
Beta Was this translation helpful? Give feedback.
-
I think we need to adjust our runner. Rest of the capabilities do exist already. Although how long of a cotnext to keep is a bit of a question and works in. kv cache quant would help here |
Beta Was this translation helpful? Give feedback.
-
In the android app we should always use |
Beta Was this translation helpful? Give feedback.
In the android app we should always use
generateFromPos()
instead ofgenerate()
. In fact we can do multiturn with reasonable perf on Llava model because the app is usinggenerateFromPos()
for multimodal LLMs (code), but this is not the case for text only. This should be a relatively simple change.