⚡🚀 100 Days of GPU Challenge 🎉🖥️

Welcome to my 100 Days of GPU journey!
This repository is a public log of my learning, experiments, and projects as I dive deep into the world of:

GPU architecture
CUDA programming
Memory hierarchies
Parallelism
Acceleration for deep learning & scientific computing

🎯 The goal: Develop a strong theoretical and practical understanding of GPU-based high-performance computing.

🧑‍🏫 Mentor & Inspiration

👨‍🔬 Mentor: @hkproj
📘 Reference Repo: 100-days-of-gpu

🌐 GPU Programming Platforms

📓 My Kaggle Notebooks

Progress Table

Day	📋 Topic	🎯 Key Learning Areas
001	🖥️ CPU vs. GPU Architectures & Parallelism	• Processor trends and Moore's Law • Latency vs. Throughput-oriented design • GPU evolution from graphics to general computing • Parallelization limitations
002	🏗️ GPU Architecture Fundamentals	• GPU architecture evolution (Fermi → Ampere) • Streaming Multiprocessors (SMs) • Warp execution and scheduling • Memory hierarchy (Shared, L1, L2) • Tensor Cores for matrix acceleration
003	➕ Vector Addition	• Data vs. Task parallelism • CUDA memory management (`cudaMalloc`, `cudaMemcpy`) • Kernel fundamentals and thread indexing • Error checking and synchronization • Function qualifiers (`__global__`, `__device__`)
004	🌈 Multidimensional Grids and Data	• 2D/3D thread organization • Multidimensional indexing techniques • Converting 2D coordinates to linear memory • Boundary condition handling
005	🖼️ Image Blur Processing & Performance	• 3×3 average filter implementation • Memory transfer vs. computation analysis • Performance optimization strategies • Shared memory patterns
006	🔢 CUDA Programming Model & Matrix Multiplication	• Scalable parallelism hierarchy • Grid Block Clusters and Thread Block Clusters • Asynchronous SIMT programming • Compute Capability features • Memory management techniques
007	🧠 L2 Cache and Shared Memory	• L2 Cache control and architecture • Memory access pattern optimization • Hit ratio strategies • L2 cache reset options • Set-aside memory layout
008	🚄 Memory Transfer Performance	• Memory types (Pageable, Pinned, Unified) • Transfer bandwidth analysis • PCIe efficiency optimization • Async transfer techniques • Batching and memory pools
009	🔄 Page-Locked Memory & Thread Coarsening	• Page-locked host memory benefits • Portable and write-combining memory • Mapped memory techniques • Thread coarsening optimization strategies
010	🔄 Memory Synchronization Domains	• Memory fence interference handling • Traffic isolation with domains • Domain usage in CUDA • Introduction to Triton programming
011	⚙️ Asynchronous & Concurrent Execution	• Vector Hadamard Product • Concurrent execution between host and device • Concurrent kernel execution • Overlap of data transfer and kernel execution • Concurrent data transfers • CUDA streams • Stream synchronization • Host functions (callbacks) • Stream priorities • Programmatic dependent launch
012	🖥️ Multi-Device System	• Device enumeration and selection • Stream and event behavior • Peer-to-peer memory access and copy • Unified Virtual Address Space • Interprocess communication (IPC) • Error checking in CUDA
013	🔄 CUDA Versioning & Compatibility	• CUDA version compatibility rules • Mix-and-match versioning between driver and runtime • Compute mode settings and switching • Understanding compatibility modes • Naive Softmax implementation
014	⚙️ Shared Memory Softmax Implementation	• Implemented Softmax using shared memory in CUDA • Hardware Implementation • SIMT Architecture • Hardware Multithreading
015	🚀 Vectorized Softmax with Shared Memory & Optimizations	• Implemented Softmax using vectorized memory access (`float4`) and shared memory • Thread Synchronization Strategy • CUDA Occupancy APIs & Concurrent Kernel Execution • Latency Hiding & Resource Impact on Occupancy
016	💾 Maximize Memory Throughput	• Implemented Softmax with coalesced memory access • Minimized Host–Device data transfers • Optimized global vs shared memory usage • Shared memory access patterns
017	🚀 Maximize Instruction Throughput	• Implemented 1D convolution • Minimized low-throughput instructions • Reduced divergent warps • Lowered total instruction count
018	💡 Performance Consideration	• Implemented Partial Sum with Reduction • Improved global memory bandwidth usage • Applied dynamic SM resource partitioning • Added data prefetching, instruction mix tuning • Optimized thread granularity
019	🧠 OS Introduction & Process Abstraction	• Implemented improved partial sum using bitwise operations • Studied OS abstractions
020	🔧 Warp Shuffling & System Calls	• Improved partial sum using warp shuffling • Integrated atomic reduction • Explored system calls for process management
021	🤝 Cooperative Groups	• Partial sum using thread-level cooperative groups • Leveraged coalesced groups and atomic aggregation
022	📚 Scheduling & IPC	• Grid-level cooperative group for partial sum • Studied scheduling policies and IPC mechanisms
023	🔄 Pipelining & Memory Patterns	• Implemented pipelined vector addition • Optimized memory access patterns • Transitioned from synchronous to asynchronous design • Debugged memory management issues
024	🧮 Matrix Multiplication (Cooperative Groups)	• Matrix multiplication using block-level cooperative groups
025	⚡ Async Data Movement & Paging	• Matrix multiplication with asynchronous data movement • Explored paging and demand paging
026	🧹 Memory Management	• Memory allocation and free space handling • Optimized vector addition implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡🚀 100 Days of GPU Challenge 🎉🖥️

🧑‍🏫 Mentor & Inspiration

🌐 GPU Programming Platforms

📓 My Kaggle Notebooks

Progress Table

References:

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 241 Commits
Day 001_ GPU vs CPU architecture		Day 001_ GPU vs CPU architecture
Day 002_Hello_GPU		Day 002_Hello_GPU
Day 003_Vector_Addition		Day 003_Vector_Addition
Day 004_Multidimensional_Grids_and_Data		Day 004_Multidimensional_Grids_and_Data
Day 005_Image_Blur		Day 005_Image_Blur
Day 006_Naive_MatMul		Day 006_Naive_MatMul
Day 007_L2 and Shared Memory		Day 007_L2 and Shared Memory
Day 008_Data_Transfer _Benchmark		Day 008_Data_Transfer _Benchmark
Day 009_Thread Coarsening		Day 009_Thread Coarsening
Day 010_Memory Synchronization Domains		Day 010_Memory Synchronization Domains
Day 011_Asynchronous Concurrent Execution		Day 011_Asynchronous Concurrent Execution
Day 012_Multi-Device System		Day 012_Multi-Device System
Day 013_CUDA Versioning_Compatibility and Modes		Day 013_CUDA Versioning_Compatibility and Modes
Day 014_Hardware Implimentation		Day 014_Hardware Implimentation
Day 015_Maximize Utilization		Day 015_Maximize Utilization
Day 016_Maximize Memory Throughput		Day 016_Maximize Memory Throughput
Day 017_Maximize Instruction Throughput		Day 017_Maximize Instruction Throughput
Day 018_Minimize Memory Thrashing AND Performance Consideration		Day 018_Minimize Memory Thrashing AND Performance Consideration
Day 019_OS Introduction and Process Abstraction		Day 019_OS Introduction and Process Abstraction
Day 020_System calls for process Management		Day 020_System calls for process Management
Day 021_Process Execution Mechanism		Day 021_Process Execution Mechanism
Day 022_Scheduling Polices and IPC		Day 022_Scheduling Polices and IPC
Day 023_Intro to Virtual Memory		Day 023_Intro to Virtual Memory
Day 024_Mechanism of Adress Translation		Day 024_Mechanism of Adress Translation
Day 025_Paging and Demand Paging		Day 025_Paging and Demand Paging
Day 026_Memory Allocation and Free space management		Day 026_Memory Allocation and Free space management
Day 027_Introduction to threads and Concurrency		Day 027_Introduction to threads and Concurrency
Day 028_ Locks and Condition variables		Day 028_ Locks and Condition variables
Day 029_Semaphores and Concurrency bugs		Day 029_Semaphores and Concurrency bugs
Day 030_ Principles of Computer Systems Design		Day 030_ Principles of Computer Systems Design
Day 031_Overview of CPU hardware		Day 031_Overview of CPU hardware
Day 032_Overview of memory and IO hardware		Day 032_Overview of memory and IO hardware
Day 033_Introduction to Operating Systems		Day 033_Introduction to Operating Systems
Day 034_Processes and Kernel mode execution		Day 034_Processes and Kernel mode execution
Day 035_Threads and CPU scheduling policies		Day 035_Threads and CPU scheduling policies
Day 036_Virtual machines and containers		Day 036_Virtual machines and containers
Day 037_Memory management in OS		Day 037_Memory management in OS
Day 038_Paging and Demand paging		Day 038_Paging and Demand paging
Day 039_File system and memory		Day 039_File system and memory
Day 040_Optimizing memory access		Day 040_Optimizing memory access
Day 041_ Compute Architecture and Scheduling Revisit		Day 041_ Compute Architecture and Scheduling Revisit
Day 042_ Memory Architecture and Data Locality REVISE		Day 042_ Memory Architecture and Data Locality REVISE
Day 043_Performance Consideration		Day 043_Performance Consideration
Day 044_Advanced Performance Optimization in CUDA		Day 044_Advanced Performance Optimization in CUDA
Day 045_CuTe Layouts		Day 045_CuTe Layouts
Day 046_Programming Tensor Cores		Day 046_Programming Tensor Cores
Day 047_Introduction to CUDA Performance		Day 047_Introduction to CUDA Performance
l2cache💡		l2cache💡
materials		materials
roadmap		roadmap
shared_memory_🧠		shared_memory_🧠
tensor_cores 📦		tensor_cores 📦
texture_memory		texture_memory
README.md		README.md

bikrammajhi/100-days-of-GPU

Folders and files

Latest commit

History

Repository files navigation

⚡🚀 100 Days of GPU Challenge 🎉🖥️

🧑‍🏫 Mentor & Inspiration

🌐 GPU Programming Platforms

📓 My Kaggle Notebooks

Progress Table

References:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages