Skip to content

bikrammajhi/100-days-of-GPU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡🚀 100 Days of GPU Challenge 🎉🖥️

Welcome to my 100 Days of GPU journey!
This repository is a public log of my learning, experiments, and projects as I dive deep into the world of:

  • GPU architecture
  • CUDA programming
  • Memory hierarchies
  • Parallelism
  • Acceleration for deep learning & scientific computing

🎯 The goal: Develop a strong theoretical and practical understanding of GPU-based high-performance computing.


🧑‍🏫 Mentor & Inspiration

🌐 GPU Programming Platforms

📓 My Kaggle Notebooks


Progress Table

YouTube YouTube CUDA PMPP GitHub Website YouTube Website YouTube YouTube GitHub YouTube Colab

Day 📋 Topic 🎯 Key Learning Areas 💻 Implementation
001 🖥️ CPU vs. GPU Architectures & Parallelism • Processor trends and Moore's Law
• Latency vs. Throughput-oriented design
• GPU evolution from graphics to general computing
• Parallelization limitations
View on GitHub
002 🏗️ GPU Architecture Fundamentals • GPU architecture evolution (Fermi → Ampere)
• Streaming Multiprocessors (SMs)
• Warp execution and scheduling
• Memory hierarchy (Shared, L1, L2)
• Tensor Cores for matrix acceleration
View on GitHub
003 Vector Addition • Data vs. Task parallelism
• CUDA memory management (cudaMalloc, cudaMemcpy)
• Kernel fundamentals and thread indexing
• Error checking and synchronization
• Function qualifiers (__global__, __device__)
View on GitHub
004 🌈 Multidimensional Grids and Data • 2D/3D thread organization
• Multidimensional indexing techniques
• Converting 2D coordinates to linear memory
• Boundary condition handling
View on GitHub
005 🖼️ Image Blur Processing & Performance • 3×3 average filter implementation
• Memory transfer vs. computation analysis
• Performance optimization strategies
• Shared memory patterns
View on GitHub
006 🔢 CUDA Programming Model & Matrix Multiplication • Scalable parallelism hierarchy
• Grid Block Clusters and Thread Block Clusters
• Asynchronous SIMT programming
• Compute Capability features
• Memory management techniques
View on GitHub
007 🧠 L2 Cache and Shared Memory • L2 Cache control and architecture
• Memory access pattern optimization
• Hit ratio strategies
• L2 cache reset options
• Set-aside memory layout
View on GitHub
008 🚄 Memory Transfer Performance • Memory types (Pageable, Pinned, Unified)
• Transfer bandwidth analysis
• PCIe efficiency optimization
• Async transfer techniques
• Batching and memory pools
View on GitHub
009 🔄 Page-Locked Memory & Thread Coarsening • Page-locked host memory benefits
• Portable and write-combining memory
• Mapped memory techniques
• Thread coarsening optimization strategies
View on GitHub
010 🔄 Memory Synchronization Domains • Memory fence interference handling
• Traffic isolation with domains
• Domain usage in CUDA
• Introduction to Triton programming
View on GitHub
011 ⚙️ Asynchronous & Concurrent Execution • Vector Hadamard Product
• Concurrent execution between host and device
• Concurrent kernel execution
• Overlap of data transfer and kernel execution
• Concurrent data transfers
• CUDA streams
• Stream synchronization
• Host functions (callbacks)
• Stream priorities
• Programmatic dependent launch
View on GitHub
012 🖥️ Multi-Device System • Device enumeration and selection
• Stream and event behavior
• Peer-to-peer memory access and copy
• Unified Virtual Address Space
• Interprocess communication (IPC)
• Error checking in CUDA
View on GitHub
013 🔄 CUDA Versioning & Compatibility • CUDA version compatibility rules
• Mix-and-match versioning between driver and runtime
• Compute mode settings and switching
• Understanding compatibility modes
• Naive Softmax implementation
View on GitHub
014 ⚙️ Shared Memory Softmax Implementation • Implemented Softmax using shared memory in CUDA
• Hardware Implementation
• SIMT Architecture
• Hardware Multithreading
View on GitHub
015 🚀 Vectorized Softmax with Shared Memory & Optimizations • Implemented Softmax using vectorized memory access (float4) and shared memory
• Thread Synchronization Strategy
• CUDA Occupancy APIs & Concurrent Kernel Execution
• Latency Hiding & Resource Impact on Occupancy
View on GitHub
016 💾 Maximize Memory Throughput • Implemented Softmax with coalesced memory access
• Minimized Host–Device data transfers
• Optimized global vs shared memory usage
• Shared memory access patterns
View on GitHub
017 🚀 Maximize Instruction Throughput • Implemented 1D convolution
• Minimized low-throughput instructions
• Reduced divergent warps
• Lowered total instruction count
View on GitHub
018 💡 Performance Consideration • Implemented Partial Sum with Reduction
• Improved global memory bandwidth usage
• Applied dynamic SM resource partitioning
• Added data prefetching, instruction mix tuning
• Optimized thread granularity
View on GitHub
019 🧠 OS Introduction & Process Abstraction • Implemented improved partial sum using bitwise operations
• Studied OS abstractions
View on GitHub
020 🔧 Warp Shuffling & System Calls • Improved partial sum using warp shuffling
• Integrated atomic reduction
• Explored system calls for process management
View on GitHub
021 🤝 Cooperative Groups • Partial sum using thread-level cooperative groups
• Leveraged coalesced groups and atomic aggregation
View on GitHub
022 📚 Scheduling & IPC • Grid-level cooperative group for partial sum
• Studied scheduling policies and IPC mechanisms
View on GitHub
023 🔄 Pipelining & Memory Patterns • Implemented pipelined vector addition
• Optimized memory access patterns
• Transitioned from synchronous to asynchronous design
• Debugged memory management issues
View on GitHub
024 🧮 Matrix Multiplication (Cooperative Groups) • Matrix multiplication using block-level cooperative groups View on GitHub
025 Async Data Movement & Paging • Matrix multiplication with asynchronous data movement
• Explored paging and demand paging
View on GitHub
026 🧹 Memory Management • Memory allocation and free space handling
• Optimized vector addition implementation
View on GitHub

References:

GitHub GitHub Awesome Awesome GitHub

Blog Blog Blog Blog Substack YouTube GitHub

Blog Blog

GitHub Blog


About

This is my 🔥 100 Days of GPU — a wild, hands-on journey through CUDA kernels, Triton spells, and PTX sorcery.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published