Skip to content

Commit 22b3fab

Browse files
jiyangzhrasbtd-kleinecasincatao-qian
authored
Pull latest from Upstream (#1)
* Add "What's next" section (rasbt#432) * Add What's next section * Delete appendix-D/01_main-chapter-code/appendix-D-Copy2.ipynb * Delete ch03/01_main-chapter-code/ch03-Copy1.ipynb * Delete appendix-D/01_main-chapter-code/appendix-D-Copy1.ipynb * Update ch07.ipynb * Update ch07.ipynb * Add chapter names * Add missing device transfer in gpt_generate.py (rasbt#436) * Add utility to prevent double execution of certain cells (rasbt#437) * Add flexible padding bonus experiment (rasbt#438) * Add flexible padding bonus experiment * fix links * Fixed command for row 16 additional experiment (rasbt#439) * fixed command for row 16 experiment * Update README.md --------- Co-authored-by: Sebastian Raschka <[email protected]> * [minor] typo & comments (rasbt#441) * typo & comment - safe -> save - commenting code: batch_size, seq_len = in_idx.shape * comment - adding # NEW for assert num_heads % num_kv_groups == 0 * update memory wording --------- Co-authored-by: rasbt <[email protected]> * fix misplaced parenthesis and update license (rasbt#466) * Minor readability improvement in dataloader.ipynb (rasbt#461) * Minor readability improvement in dataloader.ipynb - The tokenizer and encoded_text variables at the root level are unused. - The default params for create_dataloader_v1 are confusing, especially for the default batch_size 4, which happens to be the same as the max_length. * readability improvements --------- Co-authored-by: rasbt <[email protected]> * typo fixed (rasbt#468) * typo fixed * only update plot --------- Co-authored-by: rasbt <[email protected]> * Add backup URL for gpt2 weights (rasbt#469) * Add backup URL for gpt2 weights * newline * fix ch07 unit test (rasbt#470) * adds no-grad context for reference model to DPO (rasbt#473) * Auto download DPO dataset if not already available in path (rasbt#479) * Auto download DPO dataset if not already available in path * update tests to account for latest HF transformers release in unit tests * pep 8 * fix reward margins plot label in dpo nb * Print out embeddings for more illustrative learning (rasbt#481) * print out embeddings for illustrative learning * suggestion print embeddingcontents --------- Co-authored-by: rasbt <[email protected]> * Include mathematical breakdown for exercise solution 4.1 (rasbt#483) * 04_optional-aws-sagemaker-notebook (rasbt#451) * 04_optional-aws-sagemaker-notebook * Update setup/04_optional-aws-sagemaker-notebook/cloudformation-template.yml * Update README.md --------- Co-authored-by: Sebastian Raschka <[email protected]> * Implementingthe BPE Tokenizer from Scratch (rasbt#487) * BPE: fixed typo (rasbt#492) * fixed typo * use rel path if exists * mod gitignore and use existing vocab files --------- Co-authored-by: rasbt <[email protected]> * fix: preserve newline tokens in BPE encoder (rasbt#495) * fix: preserve newline tokens in BPE encoder * further fixes * more fixes --------- Co-authored-by: rasbt <[email protected]> * add GPT2TokenizerFast to BPE comparison (rasbt#498) * added HF BPE Fast * update benchmarks * add note about performance * revert accidental changes --------- Co-authored-by: rasbt <[email protected]> * Bonus material: extending tokenizers (rasbt#496) * Bonus material: extending tokenizers * small wording update * Test for PyTorch 2.6 release candidate (rasbt#500) * Test for PyTorch 2.6 release candidate * update * update * remove extra added file * A few cosmetic updates (rasbt#504) * Fix default argument in ex 7.2 (rasbt#506) * Alternative weight loading via .safetensors (rasbt#507) * Test PyTorch nightly releases (rasbt#509) --------- Co-authored-by: Sebastian Raschka <[email protected]> Co-authored-by: Daniel Kleine <[email protected]> Co-authored-by: casinca <[email protected]> Co-authored-by: Tao Qian <[email protected]> Co-authored-by: QS <[email protected]> Co-authored-by: Henry Shi <[email protected]> Co-authored-by: rvaneijk <[email protected]> Co-authored-by: Austin Welch <[email protected]>
1 parent 1183fd7 commit 22b3fab

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+4115
-451
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
name: Test latest PyTorch nightly / release candidate
2+
on:
3+
push:
4+
branches: [ main ]
5+
paths:
6+
- '**/*.py' # Run workflow for changes in Python files
7+
- '**/*.ipynb'
8+
- '**/*.yaml'
9+
- '**/*.yml'
10+
- '**/*.sh'
11+
pull_request:
12+
branches: [ main ]
13+
paths:
14+
- '**/*.py'
15+
- '**/*.ipynb'
16+
- '**/*.yaml'
17+
- '**/*.yml'
18+
- '**/*.sh'
19+
20+
jobs:
21+
test:
22+
runs-on: ubuntu-latest
23+
24+
steps:
25+
- uses: actions/checkout@v4
26+
27+
- name: Set up Python
28+
uses: actions/setup-python@v5
29+
with:
30+
python-version: "3.10"
31+
32+
- name: Install dependencies
33+
run: |
34+
python -m pip install --upgrade pip
35+
pip install pytest nbval
36+
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
37+
pip install -r ch05/07_gpt_to_llama/tests/test-requirements-extra.txt
38+
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
39+
40+
- name: Test Selected Python Scripts
41+
run: |
42+
pytest setup/02_installing-python-libraries/tests.py
43+
pytest ch04/01_main-chapter-code/tests.py
44+
pytest ch05/01_main-chapter-code/tests.py
45+
pytest ch05/07_gpt_to_llama/tests/tests.py
46+
pytest ch06/01_main-chapter-code/tests.py
47+
48+
- name: Validate Selected Jupyter Notebooks
49+
run: |
50+
pytest --nbval ch02/01_main-chapter-code/dataloader.ipynb
51+
pytest --nbval ch03/01_main-chapter-code/multihead-attention.ipynb
52+
pytest --nbval ch02/04_bonus_dataloader-intuition/dataloader-intuition.ipynb

.github/workflows/check-links.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,6 @@ jobs:
2929
3030
- name: Check links
3131
run: |
32-
pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
32+
pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://openai.com/*" --check-links-ignore "https://arena.lmsys.org" --check-links-ignore https://unsloth.ai/blog/gradient --check-links-ignore "https://www.reddit.com/r/*" --check-links-ignore "https://code.visualstudio.com/*" --check-links-ignore https://arxiv.org/* --check-links-ignore "https://ai.stanford.edu/~amaas/data/sentiment/"
3333
# pytest --check-links ./ --check-links-ignore "https://platform.openai.com/*" --check-links-ignore "https://arena.lmsys.org" --retries 2 --retry-delay 5
3434

.gitignore

+8
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ appendix-E/01_main-chapter-code/gpt2
3131

3232
ch05/01_main-chapter-code/gpt2/
3333
ch05/02_alternative_weight_loading/checkpoints
34+
ch05/02_alternative_weight_loading/*.safetensors
3435
ch05/01_main-chapter-code/model.pth
3536
ch05/01_main-chapter-code/model_and_optimizer.pth
3637
ch05/03_bonus_pretraining_on_gutenberg/model_checkpoints
@@ -101,6 +102,13 @@ ch07/02_dataset-utilities/instruction-examples-modified.json
101102
ch07/04_preference-tuning-with-dpo/gpt2-medium355M-sft.pth
102103
ch07/04_preference-tuning-with-dpo/loss-plot.pdf
103104

105+
# Tokenizer files
106+
ch02/05_bpe-from-scratch/bpe_merges.txt
107+
ch02/05_bpe-from-scratch/encoder.json
108+
ch02/05_bpe-from-scratch/vocab.bpe
109+
ch02/05_bpe-from-scratch/vocab.json
110+
111+
104112
# Other
105113
ch0?/0?_user_interface/.chainlit/
106114
ch0?/0?_user_interface/chainlit.md

LICENSE.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@
189189
same "printed page" as the copyright notice for easier
190190
identification within third-party archives.
191191

192-
Copyright 2023-2024 Sebastian Raschka
192+
Copyright 2023-2025 Sebastian Raschka
193193

194194
Licensed under the Apache License, Version 2.0 (the "License");
195195
you may not use this file except in compliance with the License.

README.md

+8-6
Original file line numberDiff line numberDiff line change
@@ -101,16 +101,17 @@ Several folders contain optional materials as a bonus for interested readers:
101101
- [Python Setup Tips](setup/01_optional-python-setup-preferences)
102102
- [Installing Python Packages and Libraries Used In This Book](setup/02_installing-python-libraries)
103103
- [Docker Environment Setup Guide](setup/03_optional-docker-environment)
104-
- **Chapter 2:**
104+
- **Chapter 2: Working with text data**
105+
- [Byte Pair Encoding (BPE) Tokenizer From Scratch](ch02/05_bpe-from-scratch/bpe-from-scratch.ipynb)
105106
- [Comparing Various Byte Pair Encoding (BPE) Implementations](ch02/02_bonus_bytepair-encoder)
106107
- [Understanding the Difference Between Embedding Layers and Linear Layers](ch02/03_bonus_embedding-vs-matmul)
107108
- [Dataloader Intuition with Simple Numbers](ch02/04_bonus_dataloader-intuition)
108-
- **Chapter 3:**
109+
- **Chapter 3: Coding attention mechanisms**
109110
- [Comparing Efficient Multi-Head Attention Implementations](ch03/02_bonus_efficient-multihead-attention/mha-implementations.ipynb)
110111
- [Understanding PyTorch Buffers](ch03/03_understanding-buffers/understanding-buffers.ipynb)
111-
- **Chapter 4:**
112+
- **Chapter 4: Implementing a GPT model from scratch**
112113
- [FLOPS Analysis](ch04/02_performance-analysis/flops-analysis.ipynb)
113-
- **Chapter 5:**
114+
- **Chapter 5: Pretraining on unlabeled data:**
114115
- [Alternative Weight Loading from Hugging Face Model Hub using Transformers](ch05/02_alternative_weight_loading/weight-loading-hf-transformers.ipynb)
115116
- [Pretraining GPT on the Project Gutenberg Dataset](ch05/03_bonus_pretraining_on_gutenberg)
116117
- [Adding Bells and Whistles to the Training Loop](ch05/04_learning_rate_schedulers)
@@ -119,11 +120,12 @@ Several folders contain optional materials as a bonus for interested readers:
119120
- [Converting GPT to Llama](ch05/07_gpt_to_llama)
120121
- [Llama 3.2 From Scratch](ch05/07_gpt_to_llama/standalone-llama32.ipynb)
121122
- [Memory-efficient Model Weight Loading](ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb)
122-
- **Chapter 6:**
123+
- [Extending the Tiktoken BPE Tokenizer with New Tokens](ch05/09_extending-tokenizers/extend-tiktoken.ipynb)
124+
- **Chapter 6: Finetuning for classification**
123125
- [Additional experiments finetuning different layers and using larger models](ch06/02_bonus_additional-experiments)
124126
- [Finetuning different models on 50k IMDB movie review dataset](ch06/03_bonus_imdb-classification)
125127
- [Building a User Interface to Interact With the GPT-based Spam Classifier](ch06/04_user_interface)
126-
- **Chapter 7:**
128+
- **Chapter 7: Finetuning to follow instructions**
127129
- [Dataset Utilities for Finding Near Duplicates and Creating Passive Voice Entries](ch07/02_dataset-utilities)
128130
- [Evaluating Instruction Responses Using the OpenAI API and Ollama](ch07/03_model-evaluation)
129131
- [Generating a Dataset for Instruction Finetuning](ch07/05_dataset-generation/llama3-ollama.ipynb)

appendix-D/01_main-chapter-code/appendix-D.ipynb

+31-32
Large diffs are not rendered by default.

appendix-E/01_main-chapter-code/gpt_download.py

+33-18
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ def download_and_load_gpt2(model_size, models_dir):
2323
# Define paths
2424
model_dir = os.path.join(models_dir, model_size)
2525
base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models"
26+
backup_base_url = "https://f001.backblazeb2.com/file/LLMs-from-scratch/gpt2"
2627
filenames = [
2728
"checkpoint", "encoder.json", "hparams.json",
2829
"model.ckpt.data-00000-of-00001", "model.ckpt.index",
@@ -33,8 +34,9 @@ def download_and_load_gpt2(model_size, models_dir):
3334
os.makedirs(model_dir, exist_ok=True)
3435
for filename in filenames:
3536
file_url = os.path.join(base_url, model_size, filename)
37+
backup_url = os.path.join(backup_base_url, model_size, filename)
3638
file_path = os.path.join(model_dir, filename)
37-
download_file(file_url, file_path)
39+
download_file(file_url, file_path, backup_url)
3840

3941
# Load settings and params
4042
tf_ckpt_path = tf.train.latest_checkpoint(model_dir)
@@ -44,11 +46,9 @@ def download_and_load_gpt2(model_size, models_dir):
4446
return settings, params
4547

4648

47-
def download_file(url, destination):
48-
# Send a GET request to download the file
49-
50-
try:
51-
with urllib.request.urlopen(url) as response:
49+
def download_file(url, destination, backup_url=None):
50+
def _attempt_download(download_url):
51+
with urllib.request.urlopen(download_url) as response:
5252
# Get the total file size from headers, defaulting to 0 if not present
5353
file_size = int(response.headers.get("Content-Length", 0))
5454

@@ -57,29 +57,44 @@ def download_file(url, destination):
5757
file_size_local = os.path.getsize(destination)
5858
if file_size == file_size_local:
5959
print(f"File already exists and is up-to-date: {destination}")
60-
return
60+
return True # Indicate success without re-downloading
6161

62-
# Define the block size for reading the file
6362
block_size = 1024 # 1 Kilobyte
6463

6564
# Initialize the progress bar with total file size
66-
progress_bar_description = os.path.basename(url) # Extract filename from URL
65+
progress_bar_description = os.path.basename(download_url)
6766
with tqdm(total=file_size, unit="iB", unit_scale=True, desc=progress_bar_description) as progress_bar:
68-
# Open the destination file in binary write mode
6967
with open(destination, "wb") as file:
70-
# Read the file in chunks and write to destination
7168
while True:
7269
chunk = response.read(block_size)
7370
if not chunk:
7471
break
7572
file.write(chunk)
76-
progress_bar.update(len(chunk)) # Update progress bar
77-
except urllib.error.HTTPError:
78-
s = (
79-
f"The specified URL ({url}) is incorrect, the internet connection cannot be established,"
80-
"\nor the requested file is temporarily unavailable.\nPlease visit the following website"
81-
" for help: https://github.com/rasbt/LLMs-from-scratch/discussions/273")
82-
print(s)
73+
progress_bar.update(len(chunk))
74+
return True
75+
76+
try:
77+
if _attempt_download(url):
78+
return
79+
except (urllib.error.HTTPError, urllib.error.URLError):
80+
if backup_url is not None:
81+
print(f"Primary URL ({url}) failed. Attempting backup URL: {backup_url}")
82+
try:
83+
if _attempt_download(backup_url):
84+
return
85+
except urllib.error.HTTPError:
86+
pass
87+
88+
# If we reach here, both attempts have failed
89+
error_message = (
90+
f"Failed to download from both primary URL ({url})"
91+
f"{' and backup URL (' + backup_url + ')' if backup_url else ''}."
92+
"\nCheck your internet connection or the file availability.\n"
93+
"For help, visit: https://github.com/rasbt/LLMs-from-scratch/discussions/273"
94+
)
95+
print(error_message)
96+
except Exception as e:
97+
print(f"An unexpected error occurred: {e}")
8398

8499

85100
# Alternative way using `requests`

ch02/01_main-chapter-code/ch02.ipynb

+19-5
Original file line numberDiff line numberDiff line change
@@ -1788,7 +1788,10 @@
17881788
],
17891789
"source": [
17901790
"token_embeddings = token_embedding_layer(inputs)\n",
1791-
"print(token_embeddings.shape)"
1791+
"print(token_embeddings.shape)\n",
1792+
"\n",
1793+
"# uncomment & execute the following line to see how the embeddings look like\n",
1794+
"# print(token_embedding)"
17921795
]
17931796
},
17941797
{
@@ -1807,7 +1810,10 @@
18071810
"outputs": [],
18081811
"source": [
18091812
"context_length = max_length\n",
1810-
"pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)"
1813+
"pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)\n",
1814+
"\n",
1815+
"# uncomment & execute the following line to see how the embedding layer weights look like\n",
1816+
"# print(pos_embedding_layer.weight)"
18111817
]
18121818
},
18131819
{
@@ -1826,7 +1832,10 @@
18261832
],
18271833
"source": [
18281834
"pos_embeddings = pos_embedding_layer(torch.arange(max_length))\n",
1829-
"print(pos_embeddings.shape)"
1835+
"print(pos_embeddings.shape)\n",
1836+
"\n",
1837+
"# uncomment & execute the following line to see how the embeddings look like\n",
1838+
"# print(pos_embeddings)"
18301839
]
18311840
},
18321841
{
@@ -1853,7 +1862,10 @@
18531862
],
18541863
"source": [
18551864
"input_embeddings = token_embeddings + pos_embeddings\n",
1856-
"print(input_embeddings.shape)"
1865+
"print(input_embeddings.shape)\n",
1866+
"\n",
1867+
"# uncomment & execute the following line to see how the embeddings look like\n",
1868+
"# print(input_embeddings)"
18571869
]
18581870
},
18591871
{
@@ -1888,7 +1900,9 @@
18881900
"source": [
18891901
"See the [./dataloader.ipynb](./dataloader.ipynb) code notebook, which is a concise version of the data loader that we implemented in this chapter and will need for training the GPT model in upcoming chapters.\n",
18901902
"\n",
1891-
"See [./exercise-solutions.ipynb](./exercise-solutions.ipynb) for the exercise solutions."
1903+
"See [./exercise-solutions.ipynb](./exercise-solutions.ipynb) for the exercise solutions.\n",
1904+
"\n",
1905+
"See the [Byte Pair Encoding (BPE) Tokenizer From Scratch](../02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb) notebook if you are interested in learning how the GPT-2 tokenizer can be implemented and trained from scratch."
18921906
]
18931907
}
18941908
],

ch02/01_main-chapter-code/dataloader.ipynb

+10-7
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,8 @@
103103
" return self.input_ids[idx], self.target_ids[idx]\n",
104104
"\n",
105105
"\n",
106-
"def create_dataloader_v1(txt, batch_size=4, max_length=256, \n",
107-
" stride=128, shuffle=True, drop_last=True, num_workers=0):\n",
106+
"def create_dataloader_v1(txt, batch_size, max_length, stride,\n",
107+
" shuffle=True, drop_last=True, num_workers=0):\n",
108108
" # Initialize the tokenizer\n",
109109
" tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
110110
"\n",
@@ -121,9 +121,6 @@
121121
"with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
122122
" raw_text = f.read()\n",
123123
"\n",
124-
"tokenizer = tiktoken.get_encoding(\"gpt2\")\n",
125-
"encoded_text = tokenizer.encode(raw_text)\n",
126-
"\n",
127124
"vocab_size = 50257\n",
128125
"output_dim = 256\n",
129126
"context_length = 1024\n",
@@ -132,8 +129,14 @@
132129
"token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)\n",
133130
"pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)\n",
134131
"\n",
132+
"batch_size = 8\n",
135133
"max_length = 4\n",
136-
"dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length)"
134+
"dataloader = create_dataloader_v1(\n",
135+
" raw_text,\n",
136+
" batch_size=batch_size,\n",
137+
" max_length=max_length,\n",
138+
" stride=max_length\n",
139+
")"
137140
]
138141
},
139142
{
@@ -189,7 +192,7 @@
189192
"name": "python",
190193
"nbconvert_exporter": "python",
191194
"pygments_lexer": "ipython3",
192-
"version": "3.10.6"
195+
"version": "3.11.4"
193196
}
194197
},
195198
"nbformat": 4,

0 commit comments

Comments
 (0)