Skip to content

Encoding issue with "Hello," #558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cahya-wirawan opened this issue Mar 5, 2025 · 13 comments · Fixed by #561
Closed

Encoding issue with "Hello," #558

cahya-wirawan opened this issue Mar 5, 2025 · 13 comments · Fixed by #561

Comments

@cahya-wirawan
Copy link

cahya-wirawan commented Mar 5, 2025

Hi,
I use your BPETokenizerSimple class in the bonus contents for encoding. There seems to have an issue to encode a simple text "Hello,":

>>> import os
>>> gpt2 = BPETokenizerSimple()
>>> gpt2.load_vocab_and_merges_from_openai(vocab_path=os.path.join("gpt2_model", "encoder.json"),bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe"))
>>> gpt2.encode("Hello")
[15496]
>>> gpt2.encode("Hello,")
[1544, 18798, 11]
>>> gpt2.encode(",")
[11]

gpt2.encode("Hello,") should actually return [15496, 11], not [1544, 18798, 11].
Tiktoken and huggingface bpe implementation return the correct ids [15496, 11].
Every words contain "," at the end have this issue also.

The comparison notebook https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb shows this issue already.

@rasbt
Copy link
Owner

rasbt commented Mar 5, 2025

Thanks for the note. I am not exactly sure why this comma issue happens. Might be that I accidentally modified something that broke it. Have to look into it some time!

@cahya-wirawan
Copy link
Author

Maybe the file vocab.bpe is not the correct one? because I don't see any line to merge "He" and "l". There are also many other words which are encoded wrongly, one of it is: "Implementations" which encoded as [29710, 8952, 8326, 20259, 684]. Tiktoken encoded it as [3546, 26908, 602]

@d-kleine
Copy link
Contributor

d-kleine commented Mar 6, 2025

I just took closer look into the BPETokenizerSimple() implementation. I also assume that the merge rule seems to be the issue here:

" while can_merge and len(token_ids) > 1:\n",
" can_merge = False\n",
" new_tokens = []\n",
" i = 0\n",
" while i < len(token_ids) - 1:\n",
" pair = (token_ids[i], token_ids[i + 1])\n",
" if pair in self.bpe_merges:\n",
" merged_token_id = self.bpe_merges[pair]\n",
" new_tokens.append(merged_token_id)\n",
" # Uncomment for educational purposes:\n",
" # print(f\"Merged pair {pair} -> {merged_token_id} ('{self.vocab[merged_token_id]}')\")\n",
" i += 2 # Skip the next token as it's merged\n",
" can_merge = True\n",
" else:\n",
" new_tokens.append(token_ids[i])\n",
" i += 1\n",
" if i < len(token_ids):\n",
" new_tokens.append(token_ids[i])\n",
" token_ids = new_tokens\n",

For me, it looks like this loop iterates through token pairs sequentially (left to right; see lines 581-582) and merges the first valid pair it encounters, instead of finding the highest-priority merge (based on the order in the merge list) across the entire token sequence.

@cahya-wirawan
Copy link
Author

cahya-wirawan commented Mar 6, 2025

Yes, I think also that the merge rule is more than just a simple merging left to right. In this BPETokenizerSimple() implementation, the right side of the pair can only be a character, although in the merge file (vocab.bpe), the right side is not only a character, but very often also several characters.

@d-kleine
Copy link
Contributor

d-kleine commented Mar 7, 2025

What also supports the theory that the merge role could be the culprit is that the BPETokenizerSimple() implementation is way too fast compared to the HF implementation here. From my pov, this looks too good to be true. If there was a highest-priority merge (which I cannot see in the current implementation, if I am not overlooking it), the pairing would be slower as it's more computationally intensive, and probably being similar to the HF implementation.

@cahya-wirawan
Copy link
Author

The implementation need to be fixed to have a fair comparison :-)

@rasbt
Copy link
Owner

rasbt commented Mar 7, 2025

The implementation need to be fixed to have a fair comparison :-)

Agreed, I will revisit and fix this!

@casinca
Copy link
Contributor

casinca commented Mar 8, 2025

I tried what @d-kleine said, changing the block code above, instead of processing the pairs as they come, I push them into a minheap so that merges with the lowest IDs are prioritized (greedily assuming earlier formed
merges are more important/frequent).

        
        while True:
            # same as before but pushing into a minheap
            new_tokens = []
            for i in range(len(token_ids) - 1):
                pair = (token_ids[i], token_ids[i + 1])
                if pair in self.bpe_merges:
                    heapq.heappush(new_tokens, (self.bpe_merges[pair], i))

            if not new_tokens:
                break

            # retrieve the idx of the selected pair to be merged (ie the pair with the lowest merged token ID)
            pair_idx = new_tokens[0][1]  # shape (merged_token_id, pair idx)
            pair_to_merge = (token_ids[pair_idx], token_ids[pair_idx + 1])
            merged_token_id = self.bpe_merges[pair_to_merge]
            # Uncomment for educational purposes:
            # print(f"Merged pair {pair_to_merge} -> {merged_token_id} ('{self.vocab[merged_token_id]}')")

            # replace the pair with the new merged token ID
            # (tokens before the pair + merged token ID + tokens after the pair)
            token_ids = token_ids[:pair_idx] + [merged_token_id] + token_ids[pair_idx + 2 :]  # +2 skip the pair

        return token_ids

It seems to work with the edge cases mentioned here but haven't tested extensively and it's not super elegant...

Image

Rerunning compare-bpe-tiktoken.ipynb it's nowhere as fast as your original @rasbt but it looks like it's still 2x faster than the HF impl if the code holds

@rasbt
Copy link
Owner

rasbt commented Mar 8, 2025

I believe the version in #561 works now :). What's new? I basically added code to handle the ranks in the OpenAI merges file as @d-kleine suggested.

I think my approach is a bit slower than your's though @casinca . Mine is on par with the HF one now (but not 2x as fast as your's shows)

Image

Btw I also added a test.py in #561 to test for edge cases you can try via uv run pytest tests.py

@d-kleine
Copy link
Contributor

d-kleine commented Mar 8, 2025

@casinca This looks better - I double-checked with compare-bpe-tiktoken.ipynb, unfortunately the token IDs for "Hello, world. Is this-- a test?" still differ from the official implementations. It seems that your implementation takes the merged token ID as priority, not the rank. What I mean by highest-priority merging is described here, and code-wise explained here.

So, the idea is to merge the most frequent adjacent token pairs as early as possible. All token pairs are sorted and ranked based on their frequency in the text, where the rank indicates how frequently the token pair appears in the corpus. The lower the rank value, the higher the rank (e.g., a token pair with rank 1 is more frequent than a token pair with rank 2, etc.). The pair with the highest rank (= lowest rank value) is then added to the merge list (merged into a new token and added to vocabulary), and this process continues iteratively until the predefined vocab size (50,257 in GPT-2, where 50,000 are learned merges) is reached.

@rasbt
Copy link
Owner

rasbt commented Mar 8, 2025

This looks better - I double-checked with compare-bpe-tiktoken.ipynb, unfortunately the token IDs for "Hello, world. Is this-- a test?" still differ from the official implementations.

I just added it to the test and it seems to work

@d-kleine
Copy link
Contributor

d-kleine commented Mar 8, 2025

Apologies, my feedback was on @casinca 's implementation.

@rasbt I just double-checked your PR, this fix looks really good (rank-based merging now implemented). Also did some testing, no issues 🙂

Edit: The tokenization speed now also looks realistic 👍🏻

@rasbt
Copy link
Owner

rasbt commented Mar 8, 2025

Got it! We must have coincidentally types our responses at a similar time, hence the confusion.

Glad it looks good and seems to work now...

@rasbt rasbt closed this as completed in #561 Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants