GitHub - shreshthtuli/llms-from-scratch: From tokenization to transformers

Welcome to LLMs from Scratch, an all-killer-no-filler curriculum that takes you from tokenization to alignment with meticulously crafted Jupyter notebooks, actionable theory, and production-ready code. Whether you are a researcher, engineer, or curious builder, this course gives you the scaffolding to demystify modern LLMs and deploy your own.

🚀 Course Highlights

Hands-on notebooks for every lesson—clone locally or launch instantly in Lightning Studio.
Practical checkpoints and datasets so you can experiment without babysitting boilerplate.
Theory, references, and best practices interwoven with code so every concept sticks.
Production-aware workflow covering training, scaling, alignment, quantization, and deployment-friendly fine-tuning.

📚 Course Structure

Each module is a standalone notebook packed with explanations, exercises, and implementation details. View them on GitHub, launch them via GitHub Pages, or open them interactively in Lightning Studio.

Module	Topic	Notebook
01	Tokenization Foundations	01-tokenization.ipynb
02	Building a Tiny LLM	02-tinyllm.ipynb
03	Advancing Our LLM	03-advancing-our-llm.ipynb
04	Data Engineering for LLMs	04-data.ipynb
05	Scaling Laws in Practice	05-scaling-laws.ipynb
06	Pretraining at Scale	06-pretraining.ipynb
07	Supervised Fine-Tuning	07-supervised-finetuning.ipynb
08	RLHF and Alignment	08-rlhf-alignment.ipynb
09	LoRA & RLVR Techniques	09-lora-rlvr.ipynb
10	Pruning & Distillation	10-pruning-distillation.ipynb
11	Appendix: Position Embeddings	11-appendix-position-embeddings.ipynb
12	Appendix: Quantisation Strategies	12-appendix-quantisation.ipynb
13	Appendix: Parameter-Efficient Tuning	13-appendix-peft.ipynb
14	Bonus: Energy Based and Diffusion LLMs	14-bonus-diffusion-llms.ipynb
15	Bonus: State Space Models	15-bonus-state-space-models.ipynb

🧠 What You'll Learn

The end-to-end data flow of an LLM—from tokenization and batching to inference-time decoding.
How to implement core transformer components, attention variations, and optimization tricks.
Strategies for scaling datasets, managing checkpoints, and monitoring training stability.
Practical alignment techniques: SFT, preference modeling, RLHF, and reward modeling.
Deployment-ready compression: pruning, distillation, quantization, and PEFT recipes.
Bonus sections on Energy based models (EBMs), Diffusion LLMs, and State Space Models (SSMs).

⚙️ Quick Start

Option A: Launch in Lightning Studio (no setup!)

Click the Open in Studio badge above.
Authenticate with Lightning (or create a free account).
Explore the notebooks in a fully provisioned environment with GPU options.
The studio has all model checkpoints saved and you can test them with code given in test-model.ipynb.

Option B: Run Locally

Clone the repository

git clone https://github.com/shreshthtuli/llms-from-scratch.git
cd llms-from-scratch

Install dependencies (recommended: Python 3.10+)
```
pip install uv
uv sync
```
Add API Keys in .env file. Follow .env.example.
Launch Jupyter
```
jupyter lab
```
Open any notebook to start experimenting.

Need data? Check the data/ directory and follow the dataset preparation steps inside each notebook.

🧭 Suggested Learning Path

Foundations (Modules 01–03) – Understand tokens, build your first transformer, and iterate on architecture improvements.
Data & Scaling (Modules 04–06) – Curate corpora, tune training loops, and scale pretraining experiments responsibly.
Alignment (Modules 07–09) – Apply SFT, RLHF, and efficient adaptation techniques to align your model with human intent.
Optimization (Modules 10–15) – Compress, fine-tune, and deploy models using state-of-the-art efficiency tricks.
Capstone – Combine your learnings to train, align, and ship a bespoke LLM tailored to your use case.

Mix and match as needed—every notebook is designed to stand on its own, but following this order unlocks the smoothest learning curve.

🛠 Hands-On Playground

Lightning Studio: Run the entire repo in the cloud with zero setup using the badge above.
GitHub Codespaces: Launch a dev container directly from the repo for quick edits.
Local GPUs / Clusters: Scripts in src/ support distributed and mixed-precision training out of the box.

👨‍🏫 About the Instructor

I’m Shreshth Tuli—researcher, builder, and educator focused on making advanced ML systems approachable. I’ve shipped production LLMs, authored peer-reviewed papers, and taught hundreds of practitioners how to wield these models responsibly. Expect honest takes, transparent trade-offs, and plenty of real-world war stories.

Connect with me on LinkedIn.

🤝 Contributions

Contributions, bug reports, and suggestions are warmly welcomed! To contribute:

Fork the repo and create a feature branch.
Open a PR describing your changes and the motivation behind them.
Tag any relevant notebooks or scripts and include screenshots/metrics if applicable.

Check the issue tracker for bite-sized tasks or open a discussion if you want to propose new modules.

📄 License

This project is open-sourced under the Apache 2.0 License. Feel free to use the materials for your own learning, workshops, or derivative courses—just keep attribution intact.

The best way to learn LLMs is to build one. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Course Highlights

📚 Course Structure

🧠 What You'll Learn

⚙️ Quick Start

Option A: Launch in Lightning Studio (no setup!)

Option B: Run Locally

🧭 Suggested Learning Path

🛠 Hands-On Playground

👨‍🏫 About the Instructor

🤝 Contributions

📄 License

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
.idea		.idea
.vscode		.vscode
assets		assets
checkpoints		checkpoints
data		data
logs		logs
runs		runs
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
01-tokenization.ipynb		01-tokenization.ipynb
02-tinyllm.ipynb		02-tinyllm.ipynb
03-advancing-our-llm.ipynb		03-advancing-our-llm.ipynb
04-data.ipynb		04-data.ipynb
05-scaling-laws.ipynb		05-scaling-laws.ipynb
06-pretraining.ipynb		06-pretraining.ipynb
07-supervised-finetuning.ipynb		07-supervised-finetuning.ipynb
08-rlhf-alignment.ipynb		08-rlhf-alignment.ipynb
09-lora-rlvr.ipynb		09-lora-rlvr.ipynb
10-pruning-distillation.ipynb		10-pruning-distillation.ipynb
11-appendix-position-embeddings.ipynb		11-appendix-position-embeddings.ipynb
12-appendix-quantisation.ipynb		12-appendix-quantisation.ipynb
13-appendix-peft.ipynb		13-appendix-peft.ipynb
14-bonus-diffusion-llms.ipynb		14-bonus-diffusion-llms.ipynb
15-bonus-state-space-models.ipynb		15-bonus-state-space-models.ipynb
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
_toc.yml		_toc.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test-model.ipynb		test-model.ipynb
uv.lock		uv.lock

License

shreshthtuli/llms-from-scratch

Folders and files

Latest commit

History

Repository files navigation

🚀 Course Highlights

📚 Course Structure

🧠 What You'll Learn

⚙️ Quick Start

Option A: Launch in Lightning Studio (no setup!)

Option B: Run Locally

🧭 Suggested Learning Path

🛠 Hands-On Playground

👨‍🏫 About the Instructor

🤝 Contributions

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages