|
246 | 246 | "metadata": {},
|
247 | 247 | "source": [
|
248 | 248 | "- The BPE algorithm was originally described in 1994: \"[A New Algorithm for Data Compression](http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM)\" by Philip Gage\n",
|
249 |
| - "- Before we get to the actual code implementation, the form that is used for LLM tokenizers today can be summarized as follows:" |
| 249 | + "- Before we get to the actual code implementation, the form that is used for LLM tokenizers today can be summarized as described in the following sections." |
250 | 250 | ]
|
251 | 251 | },
|
252 | 252 | {
|
|
286 | 286 | " \n",
|
287 | 287 | "## 1.4 BPE algorithm example\n",
|
288 | 288 | "\n",
|
289 |
| - "### 1.4.1 Concrete example of the encoding part (steps 1 & 2)\n", |
| 289 | + "### 1.4.1 Concrete example of the encoding part (steps 1 & 2 in section 1.3)\n", |
290 | 290 | "\n",
|
291 | 291 | "- Suppose we have the text (training dataset) `the cat in the hat` from which we want to build the vocabulary for a BPE tokenizer\n",
|
292 | 292 | "\n",
|
|
348 | 348 | "- and so forth\n",
|
349 | 349 | "\n",
|
350 | 350 | " \n",
|
351 |
| - "### 1.4.2 Concrete example of the decoding part (steps 3)\n", |
| 351 | + "### 1.4.2 Concrete example of the decoding part (step 3 in section 1.3)\n", |
352 | 352 | "\n",
|
353 | 353 | "- To restore the original text, we reverse the process by substituting each token ID with its corresponding pair in the reverse order they were introduced\n",
|
354 | 354 | "- Start with the final compressed text: `<258>cat in <258>hat`\n",
|
|
604 | 604 | " break\n",
|
605 | 605 | "\n",
|
606 | 606 | " # Find the pair with the best (lowest) rank\n",
|
607 |
| - " min_rank = 1_000_000_000\n", |
| 607 | + " min_rank = float(\"inf\")\n", |
608 | 608 | " bigram = None\n",
|
609 | 609 | " for p in pairs:\n",
|
610 |
| - " r = self.bpe_ranks.get(p, 1_000_000_000)\n", |
| 610 | + " r = self.bpe_ranks.get(p, float(\"inf\"))\n", |
611 | 611 | " if r < min_rank:\n",
|
612 | 612 | " min_rank = r\n",
|
613 | 613 | " bigram = p\n",
|
|
0 commit comments