Build LLM From Scratch

Overview

Inspired by Sebastian Raschka's book, "Build a Large Language Model (From Scratch)," this repository provides a practical demonstration of building LLMs from the ground up. It covers key aspects of the Transformer architecture and the intricacies involved in building and training your own LLMs.

Prerequisites

Basic knowledge of Python, Machine Learning, Neural Networks, and Large Language Models is required.

Transformer Architecture

Code Walkthrough

1. a_tokenizer.py - Tokenizer Basics

Understanding the fundamentals of tokenization. A tokenizer is a component that splits text into smaller units (tokens). This file downloads "the-verdict.txt" to read all words and prepare vocabulary. Demonstrates how to create tokens from simple sentences and large texts using delimiter characters like spaces.

2. b_tokenizer.py - Simple Tokenizer Implementation

Implements the SimpleTokenizerV1 class with two methods: encode and decode. The encode method splits words in text into tokens and returns token IDs (numbers). The decode method converts a list of token IDs back into the original text.

3. c_tokenizer.py - Enhanced Tokenizer

Implements SimpleTokenizerV2, which is more capable than SimpleTokenizerV1. It handles two additional tokens: <|unk|> for words missing from the vocabulary and <|endoftext|> as a separator between sentences.

4. d_tokenizer.py - Advanced Tokenization with tiktoken

Uses tiktoken to create a refined tokenizer. Experience how this tokenizer is more advanced compared to the previously created SimpleTokenizerV1 and SimpleTokenizerV2.

5. e_data_preparation.py - Dataset Preparation

Implements GPTDatasetV1 to accept text for dataset creation, tokenizer, and maximum length for chunks. Creates a dataset consisting of input and target tensor chunks.

6. f_embeddings.py - Basic Embeddings

A quick walkthrough of creating embeddings from simple vector data using PyTorch.

7. g_embeddings.py - Token and Position Embeddings

Creates a dataset by reading "the-verdict.txt" file using create_dataloader_v1 from e_data_preparation.py. Then creates token embeddings and positional embeddings.

8. gpt_download.py - Utility File

This file will be downloaded as part of code execution in later files. Please ignore this file for now.

9. h_self_attention.py - Self-Attention Fundamentals

Defines a PyTorch tensor (matrix) representing 6 input elements (rows), each having a 3-dimensional feature vector. These rows represent word embeddings for each word. Takes the second element as a "query" and calculates attention scores, attention weights, and context vectors step by step.

10. i_compact_self_attention.py - Compact Self-Attention

Implements simple self-attention using PyTorch in a compact form.

11. j_linear_self_attention.py - Linear Self-Attention

Implements self-attention using linear layers. Performs the following steps: a) calculates attention weights, b) applies masking, c) applies normalization, d) applies negative infinity masking, e) recalculates attention weights, f) applies dropout.

12. k_causal_self_attention.py - Causal Attention

Implements CausalAttention class. Various steps (a-f) performed in j_linear_self_attention.py are accommodated within the self-attention class using PyTorch modules.

13. l_multi_head_attention.py - Multi-Head Attention Wrapper

Implements the MultiHeadAttentionWrapper class. Defines the number of num_heads and iterates through a loop to add CausalAttention instances to a ModuleList. Essentially, MultiHeadAttentionWrapper is a collection of CausalAttention modules based on the num_heads value.

14. m_efficient_multi_head_attention.py - Efficient Multi-Head Attention

Implements an efficient multi-head attention class called MultiHeadAttention.

15. n_dummy_gpt_model.py - Basic GPT Model Skeleton

Implements DummyGPTModel class, which serves as the basic skeleton of a GPT model.

16. o_use_dummy_gpt_model.py - Using the Dummy GPT Model

Initializes the DummyGPTModel created in n_dummy_gpt_model.py and generates output by calling this model with simple inputs.

17. p_layernorm_gelu_feedforward.py - Core Neural Network Components

Implements and explains the LayerNorm, GELU, and FeedForward classes.

18. q_shortcut_dnn.py - Deep Neural Network Basics

Implements ExampleDeepNeuralNetwork to understand simple deep neural networks. Neural networks are important building blocks of the Transformer architecture.

19. r_transformer_block.py - Transformer Block Implementation

Implements the TransformerBlock class using MultiHeadAttention from m_efficient_multi_head_attention.py, FeedForward, LayerNorm from p_layernorm_gelu_feedforward.py, and Dropout.

20. s_gpt_model.py - Complete GPT Model Implementation

Implements the complete GPTModel class properly.

21. t_gpt_model_pretraining.py - Basic Model Pretraining

Pretrains the GPTModel with very basic data using a couple of input and target examples. This provides a basic understanding of large language model pretraining.

22. u_gpt_model_pretraining_using_data.py - Data-Based Pretraining

Pretrains the GPTModel with proper data created using text from the "the-verdict.txt" file.

23. v_load_gpt2_model_weights.py - Loading Pre-trained GPT-2 Weights

Downloads weights from a GPT-2 model and loads them into our GPTModel. You can choose from the following model sizes based on available GPU power:

gpt2-small (124M parameters)
gpt2-medium (355M parameters)
gpt2-large (774M parameters)
gpt2-xl (1558M parameters)

24. w_model_finetune_classifier.py - Classification Fine-tuning

Fine-tunes the GPTModel for classification tasks such as spam/not-spam detection.

25. x_finetune_model_instructions.py - Instruction Fine-tuning

Fine-tunes the GPTModel with instruction data to create an instruction-following model. The goal is to train the model to follow user instructions effectively.

26. y_neural_network_back_to_basics.py - Neural Network Fundamentals

Returns to basics and reviews fundamental neural network concepts.

27. z_gpt_model_lora_finetuned.py - LoRA Fine-tuning

Fine-tunes the model using Low-Rank Adaptation (LoRA) technique. LoRA is a parameter-efficient fine-tuning method that freezes the original LLM weights and introduces a small number of trainable rank-decomposition matrices to adapt the model to specific tasks, significantly reducing training costs and computational requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
__pycache__		__pycache__
README.md		README.md
a_tokenizer.py		a_tokenizer.py
b_tokenizer.py		b_tokenizer.py
c_tokenizer.py		c_tokenizer.py
d_tokenizer.py		d_tokenizer.py
e_data_prepration.py		e_data_prepration.py
f_embeddings.py		f_embeddings.py
g_embeddings.py		g_embeddings.py
gpt_download.py		gpt_download.py
h_self_attention.py		h_self_attention.py
i_compact_self_attention.py		i_compact_self_attention.py
image.png		image.png
j_linear_self_attention.py		j_linear_self_attention.py
k_causal_self_attention.py		k_causal_self_attention.py
l_multi_head_attention.py		l_multi_head_attention.py
m_efficient_multi_head_attention.py		m_efficient_multi_head_attention.py
n_dummy_gpt_model.py		n_dummy_gpt_model.py
o_use_dummy_gpt_model.py		o_use_dummy_gpt_model.py
p_layernorm_gelu_feedforward.py		p_layernorm_gelu_feedforward.py
q_shortcut_dnn.py		q_shortcut_dnn.py
r_transformer_block.py		r_transformer_block.py
s_gpt_model.py		s_gpt_model.py
t_gpt_model_pretraining.py		t_gpt_model_pretraining.py
the-verdict.txt		the-verdict.txt
u_gpt_model_pretraining_using_data.py		u_gpt_model_pretraining_using_data.py
v_load_gpt2_model_weights.py		v_load_gpt2_model_weights.py
w_model_finetune_classifier.py		w_model_finetune_classifier.py
x_finetune_model_instructions.py		x_finetune_model_instructions.py
y_neural_network_back_to_basics.py		y_neural_network_back_to_basics.py
z_gpt_model_lora_finetuned.py		z_gpt_model_lora_finetuned.py

meetrais/A-Z-of-Tranformer-Architecture

Folders and files

Latest commit

History

Repository files navigation