-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Encoding issue with "Hello," #558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the note. I am not exactly sure why this comma issue happens. Might be that I accidentally modified something that broke it. Have to look into it some time! |
Maybe the file vocab.bpe is not the correct one? because I don't see any line to merge "He" and "l". There are also many other words which are encoded wrongly, one of it is: "Implementations" which encoded as [29710, 8952, 8326, 20259, 684]. Tiktoken encoded it as [3546, 26908, 602] |
I just took closer look into the LLMs-from-scratch/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb Lines 575 to 593 in 86b714a
For me, it looks like this loop iterates through token pairs sequentially (left to right; see lines 581-582) and merges the first valid pair it encounters, instead of finding the highest-priority merge (based on the order in the merge list) across the entire token sequence. |
Yes, I think also that the merge rule is more than just a simple merging left to right. In this BPETokenizerSimple() implementation, the right side of the pair can only be a character, although in the merge file (vocab.bpe), the right side is not only a character, but very often also several characters. |
What also supports the theory that the merge role could be the culprit is that the |
The implementation need to be fixed to have a fair comparison :-) |
Agreed, I will revisit and fix this! |
I tried what @d-kleine said, changing the block code above, instead of processing the pairs as they come, I push them into a minheap so that merges with the lowest IDs are prioritized (greedily assuming earlier formed
while True:
# same as before but pushing into a minheap
new_tokens = []
for i in range(len(token_ids) - 1):
pair = (token_ids[i], token_ids[i + 1])
if pair in self.bpe_merges:
heapq.heappush(new_tokens, (self.bpe_merges[pair], i))
if not new_tokens:
break
# retrieve the idx of the selected pair to be merged (ie the pair with the lowest merged token ID)
pair_idx = new_tokens[0][1] # shape (merged_token_id, pair idx)
pair_to_merge = (token_ids[pair_idx], token_ids[pair_idx + 1])
merged_token_id = self.bpe_merges[pair_to_merge]
# Uncomment for educational purposes:
# print(f"Merged pair {pair_to_merge} -> {merged_token_id} ('{self.vocab[merged_token_id]}')")
# replace the pair with the new merged token ID
# (tokens before the pair + merged token ID + tokens after the pair)
token_ids = token_ids[:pair_idx] + [merged_token_id] + token_ids[pair_idx + 2 :] # +2 skip the pair
return token_ids It seems to work with the edge cases mentioned here but haven't tested extensively and it's not super elegant... Rerunning ![]() |
I believe the version in #561 works now :). What's new? I basically added code to handle the ranks in the OpenAI merges file as @d-kleine suggested. I think my approach is a bit slower than your's though @casinca . Mine is on par with the HF one now (but not 2x as fast as your's shows) ![]() Btw I also added a |
@casinca This looks better - I double-checked with So, the idea is to merge the most frequent adjacent token pairs as early as possible. All token pairs are sorted and ranked based on their frequency in the text, where the rank indicates how frequently the token pair appears in the corpus. The lower the rank value, the higher the rank (e.g., a token pair with rank 1 is more frequent than a token pair with rank 2, etc.). The pair with the highest rank (= lowest rank value) is then added to the merge list (merged into a new token and added to vocabulary), and this process continues iteratively until the predefined vocab size (50,257 in GPT-2, where 50,000 are learned merges) is reached. |
I just added it to the test and it seems to work |
Got it! We must have coincidentally types our responses at a similar time, hence the confusion. Glad it looks good and seems to work now... |
Hi,
I use your
BPETokenizerSimple
class in the bonus contents for encoding. There seems to have an issue to encode a simple text "Hello,":gpt2.encode("Hello,")
should actually return [15496, 11], not [1544, 18798, 11].Tiktoken and huggingface bpe implementation return the correct ids [15496, 11].
Every words contain "," at the end have this issue also.
The comparison notebook https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb shows this issue already.
The text was updated successfully, but these errors were encountered: