-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempt at chat function #343
Conversation
…. Seems to work but it's probably subtly broken or too complex. version 1 only, lots of hard-coded non-sensical buffer sizes. Have to go to work now
Once I validate that all the special tokens and the schema is correctly followed, I think it would make sense to hide all that stuff away and make it a pretty chat. |
Another thought: it might be possible to re-use the |
I think is important to avoid having to reload the model every time. |
is this comment related to this PR? |
…tching python. still some weirdness in the printing to chase down, and also have to tune the buffer lengths and make them sensible and such
Yes, but neither I remember exactly what I wanted to say. Sorry. What type of changes should be done to "generate()" to made it re-usable? There's something to cleanup or reset between a call and the next? |
Anyway, my biggest concern on this approach is that it has the template used by llama_7_b_chat embedded into the code. |
What other templates are there though? Isn't this the template for Llama 2 chat? |
What if anyone trains another chat model using its own template? Not necessarily starting from one of the Meta lama models. It's just my personal opinion, of course, but placing "chat" into "run" seems to mix two different things. |
If they use the exact same format as Llama 2 or finetune a Llama 2 chat model then the code is plug and play |
Ok. I see your point. I still believe it should be made more general but it's something that could wait for when someone will want to introduce a new template. |
if (token == 2) { user_turn = 1; } | ||
|
||
// forward the transformer to get logits for the next token | ||
float* logits = forward(transformer, token, pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for my knowledge, isn't there any problems with not forwarding model through tokens from users prompt? (e.g. things like KV cache and position/relative-offsets?)
for example, for "generate", I see we forward model through prompt. the sampling logits happens after "forward".
if (pos < num_prompt_tokens - 1) {
// if we are still processing the input prompt, force the next prompt token
next = prompt_tokens[pos + 1];
} else {
// otherwise sample the next token from the logits
next = sample(sampler, logits);
}
but for chat mode here we encode whole new user prompt ("pos" not incremented), then "forward" and immediately "sample" for that token
// ... add whole new user prompt
// forward the transformer to get logits for the next token
float* logits = forward(transformer, token, pos);
next = sample(sampler, logits);
pos++;
Shouldn't we "forward" through new prompt first and only then sample logits?
Add interactive loop to enable nice chat with a Llama 2 Chat model
just uploading partial work of the chat function, version 1
not sure i'm super happy with the layout of it just yet
weird hard-coded buffer sizes
seems to not be obviously broken though, e.g. on my machine: