A daily log of my journey learning and implementing deep learning and parallel computing concepts using CUDA (NVIDIA) and HIP/RoCm (AMD).
Learned how to add two 1D vectors using a basic CUDA kernel.
Keywords and Variables:
__global__: Defines a function (kernel) that runs on the GPU, launched with<<<...>>>.blockIdx.x: Current block index within a grid.blockDim.x: Number of threads per block.threadIdx.x: Current thread index within the block.
Memory Management:
cudaMalloc: Allocates memory on the GPU.cudaMemcpy: Copies memory between host (CPU) and device (GPU).cudaFree: Frees allocated GPU memory.
Learned to add two 2D matrices.
Keywords and Variables:
dim3: CUDA type for specifying 1D, 2D, or 3D dimensions for grids and blocks.cudaDeviceSynchronize: Forces CPU to wait for GPU to finish.
Matrix Addition Formula:
c[i * N + j] = a[i * N + j] + b[i * N + j]
where:
i: row
j: column
N: matrix width
Other concepts are similar to Day 01.
Learned to multiply a 2D matrix with a 1D vector.
Example:
2D matrix:
1 1 1
1 1 1
1 1 1
1D vector:
2 2 2
Result:
6 6 6
No new concepts today.
Explored shared memory in CUDA.
Keywords and Variables:
__shared__: Declares shared memory accessible by threads in a block.__syncthreads(): Synchronizes all threads in a block.
Notes:
- Static shared memory:
__shared__ int var1[10]; - Dynamic shared memory:
extern __shared__ int var1[];
Partial Sum Example:
sharedMemory[0] = in[0] + in[8] = 1 + 9 = 10
...
sharedMemory[7] = in[7] + in[15] = 8 + 16 = 24
Partial sums are then accumulated.
Implemented Layer Normalization in CUDA.
Formula: $$ \text{LayerNorm}(x_i) = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta $$ Where:
-
$\mu$ : Mean of$x$ -
$\sigma^2$ : Variance of$x$ -
$\epsilon$ : Small constant -
$\gamma$ ,$\beta$ : Learnable parameters
Learned to transpose a matrix in CUDA.
Keywords:
cudaError_t: CUDA error code type.cudaGetLastError(): Retrieves the last CUDA error.
Learned about tiled convolution in CUDA.
References:
Learned the Brent-Kung algorithm for fast prefix sum.
Reference:
Implemented Flash Attention forward pass using CUDA and shared memory.
Implemented Flash Attention for higher-dimensional tensors.
Learned about sparse matrices, ELL and COO storage formats, and their real-world importance.
Implemented parallel merge sort in CUDA.
Implemented parallel BFS and GELU activation using multiple threads.
Built a simple neural network with a linear layer.
Implemented a CNN from scratch using CUDA kernels.
Implemented FHD (Fully-Hybrid Domain) algorithm for non-cartesian MRI reconstruction in CUDA.
Learned and implemented FlashAttention-2 (forward and backward pass).
Implemented Naive Bayes with shared memory for prior and likelihood.
Used cuBLAS API for vector addition and matrix multiplication.
Explored cuDNN API for building fully connected networks and using built-in GEMM functions.
Implemented RoPE (Rotary Positional Encoding) in CUDA.
Implemented EM algorithm for 1D Gaussian vector clustering.
Implemented SwiGLU activation on 2D data.
Implemented atomicAdd in CUDA to count thread IDs.
Implemented Monte Carlo Tree Search in CUDA with 1024 parallel simulations.
Implemented histogram loss in parallel using shared memory.
Implemented mirror descent in CUDA with parallel threads.
Built a micrograd-like autograd engine in CUDA with parallel threads.
Learned to use CUDA Graphs for fast computation without changing kernels.
Implemented and experimented with deep learning operations and parallel computing using HIP for AMD GPUs. The project is organized into three folders:
DL/ — Deep Learning Operations
conv_2d.cpp: HIP-based 2D convolution for CNNs.flash_attention_forward.cpp: Efficient attention mechanisms.gelu.cpp: GELU activation function.layer_norm.cpp: Layer normalization kernel.rope_hip.cpp: Rotary positional encoding.
parallel/ — Matrix Operations with Parallelism
matmul_rocblas.cpp: Matrix multiplication using rocBLAS.matrix_add.cpp: Parallel matrix addition.matrix_trans.cpp: Matrix transpose with shared memory.parallel_merge.cpp: Data merging with thread-level parallelism.
simple/ — Introductory Parallel Programs
partial_sum.cpp: Basic reduction (sum).prefix_sum.cpp: Inclusive prefix sum (scan).vec_reocblas.cpp: Vector ops with rocBLAS.vector_add.cpp: Parallel vector addition.vector_matrix_mul.cpp: Vector-matrix multiplication.
Implemented Game of Life using shared memory in CUDA.
Implemented SGMM in AMD's HIP kernel.
Implemented MLP with ReLU (forward and backward).
Benchmarked CUDA vs CPU performance.
Ray tracing using CUDA.
Implement the Head Diffusion in HIP(AMD).
Implement the vector addition in triton.
Implement the Matrix Multiplication in triton.
Implement the Softmax in triton
Implement the fused matmul with relu.
Implement the Conv1d in triton.
Implement the Matmul using autotuing in triton.
Implement the leetgpu attention in cuda.
Implement the conv3d in cuda.
Implement the biods from leetGPU in cuda.
Implement the Muon in cuda with libtorch integration.
Performs multi-GPU parallel optimization using a Bee Colony metaheuristic to find the minimum of a simple sum-of-squares function.
Limited Mem BFGS in cuda
- Reference: https://outoftardis.github.io/daal/daal/algorithms/optimization-solvers/solvers/lbfgs.html
Conjugate Gradient Method (CMG) in cuda.
Implement the Matmul in float16 and float8 in cuda.
Implement the LSTM in cuda.
Implement the RNN and GRU in cuda.
Implement the AdaHessian in cuda.
Implement the Bi-Directional LSTM in cuda.
Implement the ddpm in cuda (Failed).
Implement the ddpm in cuda and pytorch

Implement the mish Activation function in cuda.
Implement the Wavelet transform in triton.
Implement the LayerNorm in trition. :(
Implement the Bitonic Sort in CUDA.
Implement the Simulated Annealing in trion.
Implement the Spectral Norm in CUDA with shared memory.
Implement the Group Norm in CUDA with shared memory.
Implement the Kl-Divergence in CUDA.
Implement the GeGlu in CUDA with forward and backward pass.
Implement the SwiGlu in CUDA with forward and backward pass.
Implement the Poisson Solver in CUDA.
Implement the Lora Linear in CUDA.
Implement the K means algorithm in CUDA.
Implement the TV distance in CUDA.
Implement the JSD(Jensen-Shannon Divergence) in CUDA with forward and backward passes, with loss calculation.
Implement the dyt the simple Derivative of the vector in CUDA.
Implement the mrope from qwen2vl paper in CUDA with forward and backward pass.
Implement the fused linear softmax in CUDA with loss caculation.
Implement the Contrastive Loss in CUDA.
Implement the Triplet Loss in CUDA.
Implement the Upper Triangular Matrix Multiplication in CUDA.
Implement the Huber Loss in CUDA.
Implement the linear with swish activation in CUDA.
Implement the Average pool 3D in CUDA.
Implement the SoftPlus in CUDA.
Implement the Negative Cosine Similarity in CUDA.
Implement the MinReduce for multi dimensional data in CUDA.
Implement the Tensor Matrix Multiplication in CUDA for Higher Dim.
Implement the Hard Sigmoid in CUDA.
Implement the MSE in CUDA.
Implement the symMatmul in CUDA.
Implement the Lower Triangular Matrix multiplication in CUDA.
Implement the hinge loss in CUDA.
Implement the Conv-1D in CUDA.
Implement the RMS norm in CUDA.
Implement the minimal version Transformer in CUDA.
Implement the 2D max Pooling in CUDA.
Implement the Product Over Dimension in CUDA.
Implement the Elu for FP16 in CUDA.
Implement the Leaky Relu in CUDA for float4 vector.
Implement the Gemm with shared memory also apply the RELU activation function on it. (CUDA)
Implement the Kullback Leibler Divergence in CUDA.
Implement the Fused Mixture of Experts (MoE) in CUDA using the cuBLAS API to accelerate inference performance.
Implement a minimal version of Stable Diffusion utilizing DDPM and DDIM models to achieve a rapid forward pass.