Moved the creation of the sorted vocabulary in "build_tokenizer()" #328

rdentato · 2023-08-21T09:38:08Z

Currently, the vocabulary is sorted every time the encode() function is called.
For the current version, it's not an issue since the program exits every time but if the generation code would be called within a loop to have a "chat" version, this qsort will be useless (and detrimental).

I've moved this to the build_tokenizer() function as it seems to me that this is an operation that should logically be performed at the time when we set up the tokenizer.

…" function.

karpathy · 2023-08-21T15:25:24Z

Slight downside here is that sorting the vocab probably creates latency, and if the user isn't going to prompt the model we are paying the price needlessly. Maybe we can allocate it lazily?

rdentato · 2023-08-21T15:31:14Z

Yes, it can be done in "encode()" to ensure it is done only when needed. I'll change it.

…ato/llama2.c into patch-less-qsort-in-encode

rdentato · 2023-08-21T15:39:30Z

Done. Frankly, I don't think there will be many using llama2.c with no prompt, but I may be wrong.

karpathy · 2023-08-22T01:57:26Z

Just pushed a very similar commit ty

rdentato and others added 2 commits August 21, 2023 09:34

Moved the creation of the sorted vocabulary in the "build_tokenizer()…

bd71e05

…" function.

Merge branch 'karpathy:master' into patch-less-qsort-in-encode

0e66b06

rdentato and others added 3 commits August 21, 2023 15:35

Vocabulary sorted only "on demand" by encode().

578b35e

Merge branch 'karpathy:master' into patch-less-qsort-in-encode

22f82a3

Merge branch 'patch-less-qsort-in-encode' of https://github.com/rdent…

0b6610f

…ato/llama2.c into patch-less-qsort-in-encode

karpathy closed this Aug 22, 2023

rdentato deleted the patch-less-qsort-in-encode branch August 22, 2023 07:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Moved the creation of the sorted vocabulary in "build_tokenizer()" #328

Moved the creation of the sorted vocabulary in "build_tokenizer()" #328

Uh oh!

rdentato commented Aug 21, 2023

Uh oh!

karpathy commented Aug 21, 2023

Uh oh!

rdentato commented Aug 21, 2023

Uh oh!

rdentato commented Aug 21, 2023

Uh oh!

karpathy commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Moved the creation of the sorted vocabulary in "build_tokenizer()" #328

Moved the creation of the sorted vocabulary in "build_tokenizer()" #328

Uh oh!

Conversation

rdentato commented Aug 21, 2023

Uh oh!

karpathy commented Aug 21, 2023

Uh oh!

rdentato commented Aug 21, 2023

Uh oh!

rdentato commented Aug 21, 2023

Uh oh!

karpathy commented Aug 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants