Skip to content

Commit 2f41429

Browse files
authored
Cosmetic improvements to the BPE code (#562)
1 parent f63f04d commit 2f41429

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb

+5-5
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@
246246
"metadata": {},
247247
"source": [
248248
"- The BPE algorithm was originally described in 1994: \"[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)\" by Philip Gage\n",
249-
"- Before we get to the actual code implementation, the form that is used for LLM tokenizers today can be summarized as follows:"
249+
"- Before we get to the actual code implementation, the form that is used for LLM tokenizers today can be summarized as described in the following sections."
250250
]
251251
},
252252
{
@@ -286,7 +286,7 @@
286286
" \n",
287287
"## 1.4 BPE algorithm example\n",
288288
"\n",
289-
"### 1.4.1 Concrete example of the encoding part (steps 1 & 2)\n",
289+
"### 1.4.1 Concrete example of the encoding part (steps 1 & 2 in section 1.3)\n",
290290
"\n",
291291
"- Suppose we have the text (training dataset) `the cat in the hat` from which we want to build the vocabulary for a BPE tokenizer\n",
292292
"\n",
@@ -348,7 +348,7 @@
348348
"- and so forth\n",
349349
"\n",
350350
" \n",
351-
"### 1.4.2 Concrete example of the decoding part (steps 3)\n",
351+
"### 1.4.2 Concrete example of the decoding part (step 3 in section 1.3)\n",
352352
"\n",
353353
"- To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced\n",
354354
"- Start with the final compressed text: `<258>cat in <258>hat`\n",
@@ -604,10 +604,10 @@
604604
" break\n",
605605
"\n",
606606
" # Find the pair with the best (lowest) rank\n",
607-
" min_rank = 1_000_000_000\n",
607+
" min_rank = float(\"inf\")\n",
608608
" bigram = None\n",
609609
" for p in pairs:\n",
610-
" r = self.bpe_ranks.get(p, 1_000_000_000)\n",
610+
" r = self.bpe_ranks.get(p, float(\"inf\"))\n",
611611
" if r < min_rank:\n",
612612
" min_rank = r\n",
613613
" bigram = p\n",

0 commit comments

Comments
 (0)