Closed
Description
Hi,
I use your BPETokenizerSimple
class in the bonus contents for encoding. There seems to have an issue to encode a simple text "Hello,":
>>> import os
>>> gpt2 = BPETokenizerSimple()
>>> gpt2.load_vocab_and_merges_from_openai(vocab_path=os.path.join("gpt2_model", "encoder.json"),bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe"))
>>> gpt2.encode("Hello")
[15496]
>>> gpt2.encode("Hello,")
[1544, 18798, 11]
>>> gpt2.encode(",")
[11]
gpt2.encode("Hello,")
should actually return [15496, 11], not [1544, 18798, 11].
Tiktoken and huggingface bpe implementation return the correct ids [15496, 11].
Every words contain "," at the end have this issue also.
The comparison notebook https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb shows this issue already.
Metadata
Metadata
Assignees
Labels
No labels