Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt at chat function #343

Merged
merged 5 commits into from
Aug 25, 2023
Merged

attempt at chat function #343

merged 5 commits into from
Aug 25, 2023

Conversation

karpathy
Copy link
Owner

just uploading partial work of the chat function, version 1
not sure i'm super happy with the layout of it just yet
weird hard-coded buffer sizes
seems to not be obviously broken though, e.g. on my machine:

ubuntu:~/llamac$ make runomp
ubuntu:~/llamac$ OMP_NUM_THREADS=16 ./run llama2_7b_chat.bin -m chat -y "You only respond with just the first single word that comes to mind" -i "America"

<s>
[INST] <<SYS>>
You only respond with just the first single word that comes to mind
<</SYS>>

America
[/INST]  Flag
</s>
Enter user prompt: Canada

<s>
[INST] Canada [/INST]  Maple
</s>
Enter user prompt: France

<s>
[INST] France [/INST]  Eiffel
</s>
Enter user prompt: Italy

<s>
[INST] Italy [/INST]  Pizza
</s>
Enter user prompt: Iceland

<s>
[INST] Iceland [/INST]  Glacier
</s>
Enter user prompt: Middle Earth

<s>
[INST] Middle Earth [/INST]  Hobbit
</s>
Enter user prompt: 

…. Seems to work but it's probably subtly broken or too complex. version 1 only, lots of hard-coded non-sensical buffer sizes. Have to go to work now
@karpathy
Copy link
Owner Author

Once I validate that all the special tokens and the schema is correctly followed, I think it would make sense to hide all that stuff away and make it a pretty chat.

@karpathy
Copy link
Owner Author

Another thought: it might be possible to re-use the generate function for chat. but it would require quite a few changes inside it, not sure if that's the better path. Leaning to no atm.

@rdentato
Copy link
Contributor

I think is important to avoid having to reload the model every time.
With CUDA (and, in general, using a GPU) it will be a slow process. It takes almost 2min on my T4 to load the fp16 llama2-7b-chat model.

@karpathy
Copy link
Owner Author

I think is important to avoid having to reload the model every time.
With CUDA (and, in general, using a GPU) it will be a slow process. It takes almost 2min on my T4 to load the fp16 llama2-7b-chat model.

is this comment related to this PR?

…tching python. still some weirdness in the printing to chase down, and also have to tune the buffer lengths and make them sensible and such
@karpathy karpathy marked this pull request as ready for review August 25, 2023 14:58
@karpathy karpathy merged commit 49daf18 into master Aug 25, 2023
6 checks passed
@rdentato
Copy link
Contributor

rdentato commented Aug 25, 2023

Yes, but neither I remember exactly what I wanted to say. Sorry.

What type of changes should be done to "generate()" to made it re-usable? There's something to cleanup or reset between a call and the next?

@rdentato
Copy link
Contributor

Anyway, my biggest concern on this approach is that it has the template used by llama_7_b_chat embedded into the code.
I don't think it should be merged into the master until there's a way to choose the template.
Unless you consider this version temporary and plan to add the ability to use different templates to a later version.

@karpathy
Copy link
Owner Author

What other templates are there though? Isn't this the template for Llama 2 chat?

@rdentato
Copy link
Contributor

rdentato commented Aug 25, 2023

What if anyone trains another chat model using its own template? Not necessarily starting from one of the Meta lama models.
Isn't llama2.c intent to allow people to train (or fine tune) their model and then launch inference on them?

It's just my personal opinion, of course, but placing "chat" into "run" seems to mix two different things.
Personally, I would rather have seen the core abstracted as a library and then two main: one for "run" and one for "chat".
Again, just what I would have expected to see.

@karpathy
Copy link
Owner Author

If they use the exact same format as Llama 2 or finetune a Llama 2 chat model then the code is plug and play
If they wish to innovate on their format then they have to develop their own chat code

@rdentato
Copy link
Contributor

Ok. I see your point. I still believe it should be made more general but it's something that could wait for when someone will want to introduce a new template.

if (token == 2) { user_turn = 1; }

// forward the transformer to get logits for the next token
float* logits = forward(transformer, token, pos);
Copy link
Contributor

@nikolaydubina nikolaydubina Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my knowledge, isn't there any problems with not forwarding model through tokens from users prompt? (e.g. things like KV cache and position/relative-offsets?)


for example, for "generate", I see we forward model through prompt. the sampling logits happens after "forward".

        if (pos < num_prompt_tokens - 1) {
            // if we are still processing the input prompt, force the next prompt token
            next = prompt_tokens[pos + 1];
        } else {
            // otherwise sample the next token from the logits
            next = sample(sampler, logits);
        }

but for chat mode here we encode whole new user prompt ("pos" not incremented), then "forward" and immediately "sample" for that token

        // ... add whole new user prompt

        // forward the transformer to get logits for the next token
        float* logits = forward(transformer, token, pos);
        next = sample(sampler, logits);
        pos++;

Shouldn't we "forward" through new prompt first and only then sample logits?

vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024
Add interactive loop to enable nice chat with a Llama 2 Chat model
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants