Implementing the BPE Tokenizer from Scratch #487

rasbt · 2025-01-17T18:07:09Z

Yesterday, a reader recently shared an insightful discussion (#485 ) about training an LLM tokenizer for Nepali. This reminded me that I had originally implemented a Byte Pair Encoding (BPE) Tokenizer for my LLMs from Scratch book but never uploaded it to the bonus materials.

While my book explains BPE, I opted to use the highly-performant tiktoken library (which is used for GPT-4 and now also used for Llama 3 as well) for practical purposes: the book focuses on LLMs rather than tokenizer development. However, one limitation of tiktoken (please correct me if I'm wrong) is that it doesn’t support training a tokenizer on custom datasets, so if you want to support new languages or other special structures, you are kind of out of luck.

Several readers asked me about this over the last few months, so to provide a hands-on resource for tinkerers, I've uploaded my standalone notebook showcasing my BPE implementation with training support. It's for educational purposes and hopefully makes for a fun weekend project if you're interested in experimenting with tokenization.

Surprisingly, my tokenizer implementation is only 2x slower than tiktoken (Rust implementation) and 5x faster than the Hugging Face tokenizer... what an unexpected surprise!

review-notebook-app · 2025-01-17T18:07:15Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Sunilyadav03 · 2025-01-18T03:54:37Z

Thanks a lot sir, Because you opened new window for students those are just jumped in the research field of LLMs.

d-kleine · 2025-01-18T12:56:12Z

@rasbt You can also add .decode_single_token_bytes() to display the byte representation of each token (useful for tokens that cannot be printed):
https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken#4-turn-tokens-into-text-with-encodingdecode

for i in range(300):
    decoded = gpt2_tokenizer.decode_single_token_bytes(i)
    print(f"{i}: {decoded}")

rasbt · 2025-01-18T18:05:09Z

Thanks for the suggestion, but I am not sure this has that much practical use case and might clutter the code even more 😅

d-kleine · 2025-01-19T12:23:35Z

I think it would be helpful for understanding as the BPE tokenizer stores tokens represented as bytes for compression (that's why the first 256 elements are blocked as the represent all possible byte values). I think also the trained vocab.json should be converted to to bytes, or at least adding an information to the text.

rasbt · 2025-01-19T14:41:39Z

Yes, I should iterate on that more ... but yeah this itself could be a whole book haha

* Add "What's next" section (rasbt#432) * Add What's next section * Delete appendix-D/01_main-chapter-code/appendix-D-Copy2.ipynb * Delete ch03/01_main-chapter-code/ch03-Copy1.ipynb * Delete appendix-D/01_main-chapter-code/appendix-D-Copy1.ipynb * Update ch07.ipynb * Update ch07.ipynb * Add chapter names * Add missing device transfer in gpt_generate.py (rasbt#436) * Add utility to prevent double execution of certain cells (rasbt#437) * Add flexible padding bonus experiment (rasbt#438) * Add flexible padding bonus experiment * fix links * Fixed command for row 16 additional experiment (rasbt#439) * fixed command for row 16 experiment * Update README.md --------- Co-authored-by: Sebastian Raschka <[email protected]> * [minor] typo & comments (rasbt#441) * typo & comment - safe -> save - commenting code: batch_size, seq_len = in_idx.shape * comment - adding # NEW for assert num_heads % num_kv_groups == 0 * update memory wording --------- Co-authored-by: rasbt <[email protected]> * fix misplaced parenthesis and update license (rasbt#466) * Minor readability improvement in dataloader.ipynb (rasbt#461) * Minor readability improvement in dataloader.ipynb - The tokenizer and encoded_text variables at the root level are unused. - The default params for create_dataloader_v1 are confusing, especially for the default batch_size 4, which happens to be the same as the max_length. * readability improvements --------- Co-authored-by: rasbt <[email protected]> * typo fixed (rasbt#468) * typo fixed * only update plot --------- Co-authored-by: rasbt <[email protected]> * Add backup URL for gpt2 weights (rasbt#469) * Add backup URL for gpt2 weights * newline * fix ch07 unit test (rasbt#470) * adds no-grad context for reference model to DPO (rasbt#473) * Auto download DPO dataset if not already available in path (rasbt#479) * Auto download DPO dataset if not already available in path * update tests to account for latest HF transformers release in unit tests * pep 8 * fix reward margins plot label in dpo nb * Print out embeddings for more illustrative learning (rasbt#481) * print out embeddings for illustrative learning * suggestion print embeddingcontents --------- Co-authored-by: rasbt <[email protected]> * Include mathematical breakdown for exercise solution 4.1 (rasbt#483) * 04_optional-aws-sagemaker-notebook (rasbt#451) * 04_optional-aws-sagemaker-notebook * Update setup/04_optional-aws-sagemaker-notebook/cloudformation-template.yml * Update README.md --------- Co-authored-by: Sebastian Raschka <[email protected]> * Implementingthe BPE Tokenizer from Scratch (rasbt#487) * BPE: fixed typo (rasbt#492) * fixed typo * use rel path if exists * mod gitignore and use existing vocab files --------- Co-authored-by: rasbt <[email protected]> * fix: preserve newline tokens in BPE encoder (rasbt#495) * fix: preserve newline tokens in BPE encoder * further fixes * more fixes --------- Co-authored-by: rasbt <[email protected]> * add GPT2TokenizerFast to BPE comparison (rasbt#498) * added HF BPE Fast * update benchmarks * add note about performance * revert accidental changes --------- Co-authored-by: rasbt <[email protected]> * Bonus material: extending tokenizers (rasbt#496) * Bonus material: extending tokenizers * small wording update * Test for PyTorch 2.6 release candidate (rasbt#500) * Test for PyTorch 2.6 release candidate * update * update * remove extra added file * A few cosmetic updates (rasbt#504) * Fix default argument in ex 7.2 (rasbt#506) * Alternative weight loading via .safetensors (rasbt#507) * Test PyTorch nightly releases (rasbt#509) --------- Co-authored-by: Sebastian Raschka <[email protected]> Co-authored-by: Daniel Kleine <[email protected]> Co-authored-by: casinca <[email protected]> Co-authored-by: Tao Qian <[email protected]> Co-authored-by: QS <[email protected]> Co-authored-by: Henry Shi <[email protected]> Co-authored-by: rvaneijk <[email protected]> Co-authored-by: Austin Welch <[email protected]>

Implementingthe BPE Tokenizer from Scratch

4fe842d

rasbt merged commit 0d4967e into main Jan 17, 2025
8 checks passed

rasbt deleted the bpe-from-scratch branch January 17, 2025 18:22

rasbt changed the title ~~Implementingthe BPE Tokenizer from Scratch~~ Implementing the BPE Tokenizer from Scratch Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing the BPE Tokenizer from Scratch #487

Implementing the BPE Tokenizer from Scratch #487

rasbt commented Jan 17, 2025 •

edited

Loading

review-notebook-app bot commented Jan 17, 2025

Sunilyadav03 commented Jan 18, 2025

d-kleine commented Jan 18, 2025

rasbt commented Jan 18, 2025

d-kleine commented Jan 19, 2025

rasbt commented Jan 19, 2025

Implementing the BPE Tokenizer from Scratch #487

Implementing the BPE Tokenizer from Scratch #487

Conversation

rasbt commented Jan 17, 2025 • edited Loading

review-notebook-app bot commented Jan 17, 2025

Sunilyadav03 commented Jan 18, 2025

d-kleine commented Jan 18, 2025

rasbt commented Jan 18, 2025

d-kleine commented Jan 19, 2025

rasbt commented Jan 19, 2025

rasbt commented Jan 17, 2025 •

edited

Loading