Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] Calculate Tokens size? #15

Open
rk-teche opened this issue Feb 11, 2023 · 6 comments
Open

[Issue] Calculate Tokens size? #15

rk-teche opened this issue Feb 11, 2023 · 6 comments

Comments

@rk-teche
Copy link

Token size is not accurate if we compare it with GPT-3 Token.

Any help would be helpful.
Thanks

@rk-teche rk-teche changed the title [Issue] How can we get the calculate Tokens? [Issue] Calculate Tokens size? Feb 11, 2023
@evilDave
Copy link

evilDave commented Mar 3, 2023

Did you have an example that (still) does not work - the token count is identical for any text that I have checked.

@lhr0909
Copy link
Contributor

lhr0909 commented Mar 13, 2023

@rk-teche thank you for your feedback! There could be a discrepancy with the current OpenAI models, especially when compare with token counts from the API outputs. I am going to spend some time to try to move token calculation to use OpenAI's own tiktoken inside my package, as part of v2 work.

@Aldo111
Copy link

Aldo111 commented Apr 11, 2023

Hi I found one issue where this package doesn't count newlines properly while the gpt tokenizer adds 2 tokens per newline.

Eg for "Hello\n\n" this package returns 2 tokens but the online gpt tokenizer returns 5. Does the package trim the text or such?

image

@kitfit-dave
Copy link

I think what you will find is that the online tokenizer does not recognise \n as a newline (but the two characters \ and n). Just put in two hard newlines and you will get 2 tokens, also, look at the token ids for your entered string: [15496, 59, 77, 59, 77] where 59 is \ and 77 is n. Alternatively, test gpt3-tokenizer with the string 'Hello\\n\\n' and it will come out as 5.

@Aldo111
Copy link

Aldo111 commented Apr 12, 2023

Yep I'm aware of \ + n being counted as separated since it showed it clearly in the tokenizer screenshot above. In your last example then would the most appropriate way be to escape the string (or special chars) before passing it to the tokenizer?

Alternatively what I've ended up doing is using the tokenizer as an estimation and not a fact (which also generally makes sense given the documentation and model differences long term) and following the Deep Dive Counting Tokens guide (for gpt3.5+) in the OpenAI docs. The combination of gpt3-tokenizer with the estimations they've provided in the doc is super helpful and brings the results a bit closer to accuracy.

@kitfit-dave
Copy link

For passing to the tokeniser, you should escape in the regular javascript way, so Hello followed by two newlines is "Hello\n\n" - is that not giving you 2 tokens? Or are you saying that you think the answer should be 5? The online tokeniser where your screenshot is from does not accept escaped characters, only literal characters - if you want a newline there you should type a newline, only then are you comparing apples to apples.

I've been commenting on these issues where folks are saying that "it's an estimate" or "it's not correct" because I switched to this library because it seems to be exactly correct. I felt of work had been done in this project to make it so, and I'd like everyone to benefit from that, knowing the results are accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants