Multiple conversation turns in android app #8218

chuber11 · 2024-06-05T23:45:27Z

chuber11
Jun 5, 2024

I have a short question. I build and uploaded an android app deploying LLama3 (https://bwsyncandshare.kit.edu/s/t3898Ge7AZ6SWBn).
However, I couldnt get the model to continue using the last conversation turns.

I assume that the kv cache is stored in module_ internally and here
https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/runner.cpp#L175
only the last decoded token and the position index of that token is given to the model. Is that correct?

To use the last conversation turns within the next prompt I tried to start here
https://github.com/pytorch/executorch/blob/main/examples/models/llama2/runner/runner.cpp#L277
not with 0 as start position index but with the number of tokens which were decoded during the last conversation turns. However, that didn't work, because the model didn't remember the last conversations (I tried e.g. "My name is Christian" -> answer -> "What is my name?"). Is my approach wrong?

For performance reasons I don't want to give the whole conversation history multiple times to the model.

Best,
Christian

Answered by larryliu0820

Feb 6, 2025

In the android app we should always use generateFromPos() instead of generate() . In fact we can do multiturn with reasonable perf on Llava model because the app is using generateFromPos() for multimodal LLMs (code), but this is not the case for text only. This should be a relatively simple change.

View full answer

digantdesai · 2024-06-13T16:57:29Z

digantdesai
Jun 13, 2024
Collaborator

@JacobSzwejbka any ideas on how one might do this?

0 replies

kimishpatel · 2025-02-06T15:52:41Z

kimishpatel
Feb 6, 2025
Collaborator

I think we need to adjust our runner. Rest of the capabilities do exist already. Although how long of a cotnext to keep is a bit of a question and works in. kv cache quant would help here

0 replies

larryliu0820 · 2025-02-06T19:09:43Z

larryliu0820
Feb 6, 2025
Collaborator

In the android app we should always use generateFromPos() instead of generate() . In fact we can do multiturn with reasonable perf on Llava model because the app is using generateFromPos() for multimodal LLMs (code), but this is not the case for text only. This should be a relatively simple change.

2 replies

larryliu0820 Feb 6, 2025
Collaborator

cc @kirklandsign

kirklandsign Feb 6, 2025
Collaborator

Tracker: #8290

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple conversation turns in android app #8218

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multiple conversation turns in android app #8218

chuber11 Jun 5, 2024

Replies: 3 comments · 2 replies

digantdesai Jun 13, 2024 Collaborator

kimishpatel Feb 6, 2025 Collaborator

larryliu0820 Feb 6, 2025 Collaborator

larryliu0820 Feb 6, 2025 Collaborator

kirklandsign Feb 6, 2025 Collaborator

chuber11
Jun 5, 2024

Replies: 3 comments 2 replies

digantdesai
Jun 13, 2024
Collaborator

kimishpatel
Feb 6, 2025
Collaborator

larryliu0820
Feb 6, 2025
Collaborator

larryliu0820 Feb 6, 2025
Collaborator

kirklandsign Feb 6, 2025
Collaborator