Encoding issue with "Hello,"

Hi,
I use your [`BPETokenizerSimple` class](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb) in the bonus contents for encoding. There seems to have an issue to encode a simple text "Hello,":

```python
>>> import os
>>> gpt2 = BPETokenizerSimple()
>>> gpt2.load_vocab_and_merges_from_openai(vocab_path=os.path.join("gpt2_model", "encoder.json"),bpe_merges_path=os.path.join("gpt2_model", "vocab.bpe"))
>>> gpt2.encode("Hello")
[15496]
>>> gpt2.encode("Hello,")
[1544, 18798, 11]
>>> gpt2.encode(",")
[11]
```

`gpt2.encode("Hello,")` should actually return [15496, 11], not [1544, 18798, 11].
Tiktoken and huggingface bpe implementation return the correct ids [15496, 11].
Every words contain "," at the end have this issue also.

The comparison notebook https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb shows this issue already. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Encoding issue with "Hello," #558

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Encoding issue with "Hello," #558

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions