Skip to content

Commit 4ff7430

Browse files
authored
BPE cosmetics (#629)
* Llama3 from scratch improvements * Cosmetic BPE improvements * restore * Update ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb * Update ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb * endoftext whitespace
1 parent adaf4fa commit 4ff7430

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -861,7 +861,7 @@
861861
"metadata": {},
862862
"source": [
863863
"- Next, let's initialize and train the BPE tokenizer with a vocabulary size of 1,000\n",
864-
"- Note that the vocabulary size is already 255 by default due to the byte values discussed earlier, so we are only \"learning\" 745 vocabulary entries \n",
864+
"- Note that the vocabulary size is already 256 by default due to the byte values discussed earlier, so we are only \"learning\" 744 vocabulary entries (if we consider the `<|endoftext|>` special token and the `Ġ` whitespace token; so, that's 742 to be precise)\n",
865865
"- For comparison, the GPT-2 vocabulary is 50,257 tokens, the GPT-4 vocabulary is 100,256 tokens (`cl100k_base` in tiktoken), and GPT-4o uses 199,997 tokens (`o200k_base` in tiktoken); they have all much bigger training sets compared to our simple example text above"
866866
]
867867
},
@@ -908,7 +908,7 @@
908908
"id": "36c9da0f-8a18-41cd-91ea-9ccc2bb5febb",
909909
"metadata": {},
910910
"source": [
911-
"- This vocabulary is created by merging 742 times (~ `1000 - len(range(0, 256))`)"
911+
"- This vocabulary is created by merging 742 times (`= 1000 - len(range(0, 256)) - len(special_tokens) - \"Ġ\" = 1000 - 256 - 1 - 1 = 742`)"
912912
]
913913
},
914914
{
@@ -975,12 +975,12 @@
975975
"name": "stdout",
976976
"output_type": "stream",
977977
"text": [
978-
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 256, 60, 124, 271, 683, 102, 116, 461, 116, 124, 62]\n"
978+
"[424, 256, 654, 531, 302, 311, 256, 296, 97, 465, 121, 595, 841, 116, 287, 466, 256, 326, 972, 46, 60, 124, 271, 683, 102, 116, 461, 116, 124, 62]\n"
979979
]
980980
}
981981
],
982982
"source": [
983-
"input_text = \"Jack embraced beauty through art and life. <|endoftext|> \"\n",
983+
"input_text = \"Jack embraced beauty through art and life.<|endoftext|> \"\n",
984984
"token_ids = tokenizer.encode(input_text)\n",
985985
"print(token_ids)"
986986
]
@@ -1000,7 +1000,7 @@
10001000
}
10011001
],
10021002
"source": [
1003-
"input_text = \"Jack embraced beauty through art and life. <|endoftext|> \"\n",
1003+
"input_text = \"Jack embraced beauty through art and life.<|endoftext|> \"\n",
10041004
"token_ids = tokenizer.encode(input_text, allowed_special={\"<|endoftext|>\"})\n",
10051005
"print(token_ids)"
10061006
]
@@ -1015,7 +1015,7 @@
10151015
"name": "stdout",
10161016
"output_type": "stream",
10171017
"text": [
1018-
"Number of characters: 57\n",
1018+
"Number of characters: 56\n",
10191019
"Number of token IDs: 21\n"
10201020
]
10211021
}

0 commit comments

Comments
 (0)