attempt at chat function #343

karpathy · 2023-08-23T16:31:32Z

just uploading partial work of the chat function, version 1
not sure i'm super happy with the layout of it just yet
weird hard-coded buffer sizes
seems to not be obviously broken though, e.g. on my machine:

ubuntu:~/llamac$ make runomp
ubuntu:~/llamac$ OMP_NUM_THREADS=16 ./run llama2_7b_chat.bin -m chat -y "You only respond with just the first single word that comes to mind" -i "America"

<s>
[INST] <<SYS>>
You only respond with just the first single word that comes to mind
<</SYS>>

America
[/INST]  Flag
</s>
Enter user prompt: Canada

<s>
[INST] Canada [/INST]  Maple
</s>
Enter user prompt: France

<s>
[INST] France [/INST]  Eiffel
</s>
Enter user prompt: Italy

<s>
[INST] Italy [/INST]  Pizza
</s>
Enter user prompt: Iceland

<s>
[INST] Iceland [/INST]  Glacier
</s>
Enter user prompt: Middle Earth

<s>
[INST] Middle Earth [/INST]  Hobbit
</s>
Enter user prompt:

…. Seems to work but it's probably subtly broken or too complex. version 1 only, lots of hard-coded non-sensical buffer sizes. Have to go to work now

karpathy · 2023-08-23T16:32:13Z

Once I validate that all the special tokens and the schema is correctly followed, I think it would make sense to hide all that stuff away and make it a pretty chat.

karpathy · 2023-08-23T16:33:41Z

Another thought: it might be possible to re-use the generate function for chat. but it would require quite a few changes inside it, not sure if that's the better path. Leaning to no atm.

rdentato · 2023-08-23T21:28:01Z

I think is important to avoid having to reload the model every time.
With CUDA (and, in general, using a GPU) it will be a slow process. It takes almost 2min on my T4 to load the fp16 llama2-7b-chat model.

karpathy · 2023-08-24T02:55:19Z

I think is important to avoid having to reload the model every time.
With CUDA (and, in general, using a GPU) it will be a slow process. It takes almost 2min on my T4 to load the fp16 llama2-7b-chat model.

is this comment related to this PR?

…tching python. still some weirdness in the printing to chase down, and also have to tune the buffer lengths and make them sensible and such

rdentato · 2023-08-25T15:10:24Z

Yes, but neither I remember exactly what I wanted to say. Sorry.

What type of changes should be done to "generate()" to made it re-usable? There's something to cleanup or reset between a call and the next?

rdentato · 2023-08-25T15:13:30Z

Anyway, my biggest concern on this approach is that it has the template used by llama_7_b_chat embedded into the code.
I don't think it should be merged into the master until there's a way to choose the template.
Unless you consider this version temporary and plan to add the ability to use different templates to a later version.

karpathy · 2023-08-25T15:16:30Z

What other templates are there though? Isn't this the template for Llama 2 chat?

rdentato · 2023-08-25T15:19:20Z

What if anyone trains another chat model using its own template? Not necessarily starting from one of the Meta lama models.
Isn't llama2.c intent to allow people to train (or fine tune) their model and then launch inference on them?

It's just my personal opinion, of course, but placing "chat" into "run" seems to mix two different things.
Personally, I would rather have seen the core abstracted as a library and then two main: one for "run" and one for "chat".
Again, just what I would have expected to see.

karpathy · 2023-08-25T15:26:46Z

If they use the exact same format as Llama 2 or finetune a Llama 2 chat model then the code is plug and play
If they wish to innovate on their format then they have to develop their own chat code

rdentato · 2023-08-25T15:45:55Z

Ok. I see your point. I still believe it should be made more general but it's something that could wait for when someone will want to introduce a new template.

nikolaydubina · 2023-08-26T12:16:14Z

run.c

+        if (token == 2) { user_turn = 1; }
+
+        // forward the transformer to get logits for the next token
+        float* logits = forward(transformer, token, pos);


just for my knowledge, isn't there any problems with not forwarding model through tokens from users prompt? (e.g. things like KV cache and position/relative-offsets?)

for example, for "generate", I see we forward model through prompt. the sampling logits happens after "forward".

if (pos < num_prompt_tokens - 1) { // if we are still processing the input prompt, force the next prompt token next = prompt_tokens[pos + 1]; } else { // otherwise sample the next token from the logits next = sample(sampler, logits); }

but for chat mode here we encode whole new user prompt ("pos" not incremented), then "forward" and immediately "sample" for that token

// ... add whole new user prompt // forward the transformer to get logits for the next token float* logits = forward(transformer, token, pos); next = sample(sampler, logits); pos++;

Shouldn't we "forward" through new prompt first and only then sample logits?

Add interactive loop to enable nice chat with a Llama 2 Chat model

attempt at chat function, but it was 8AM and I didn't have coffee yet…

c5e0e7f

…. Seems to work but it's probably subtly broken or too complex. version 1 only, lots of hard-coded non-sensical buffer sizes. Have to go to work now

karpathy added 4 commits August 24, 2023 03:33

fix chat format bug i think

40fb902

ok getting closer, and manually verified correctness of the schema ma…

3d787b2

…tching python. still some weirdness in the printing to chase down, and also have to tune the buffer lengths and make them sensible and such

adjust things a bit

fbe324f

Merge branch 'master' into feature/chat

4a7a62b

karpathy marked this pull request as ready for review August 25, 2023 14:58

karpathy merged commit 49daf18 into master Aug 25, 2023

nikolaydubina reviewed Aug 26, 2023

View reviewed changes

vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024

Merge pull request karpathy#343 from karpathy/feature/chat

2148496

Add interactive loop to enable nice chat with a Llama 2 Chat model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

attempt at chat function #343

attempt at chat function #343

Uh oh!

karpathy commented Aug 23, 2023

Uh oh!

karpathy commented Aug 23, 2023

Uh oh!

karpathy commented Aug 23, 2023

Uh oh!

rdentato commented Aug 23, 2023

Uh oh!

karpathy commented Aug 24, 2023

Uh oh!

rdentato commented Aug 25, 2023 •

edited

Loading

Uh oh!

rdentato commented Aug 25, 2023

Uh oh!

karpathy commented Aug 25, 2023

Uh oh!

rdentato commented Aug 25, 2023 •

edited

Loading

Uh oh!

karpathy commented Aug 25, 2023

Uh oh!

rdentato commented Aug 25, 2023

Uh oh!

nikolaydubina Aug 26, 2023 •

edited

Loading

Uh oh!

Uh oh!

attempt at chat function #343

attempt at chat function #343

Uh oh!

Conversation

karpathy commented Aug 23, 2023

Uh oh!

karpathy commented Aug 23, 2023

Uh oh!

karpathy commented Aug 23, 2023

Uh oh!

rdentato commented Aug 23, 2023

Uh oh!

karpathy commented Aug 24, 2023

Uh oh!

rdentato commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdentato commented Aug 25, 2023

Uh oh!

karpathy commented Aug 25, 2023

Uh oh!

rdentato commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karpathy commented Aug 25, 2023

Uh oh!

rdentato commented Aug 25, 2023

Uh oh!

nikolaydubina Aug 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdentato commented Aug 25, 2023 •

edited

Loading

rdentato commented Aug 25, 2023 •

edited

Loading

nikolaydubina Aug 26, 2023 •

edited

Loading