Skip to content

Han-sok/TVM_DNN_testOperations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

High-Performance DNN Optimization with TVM

Project Overview

This project demonstrates the optimization of fundamental Deep Neural Network (DNN) operations using Apache TVM, a deep learning compiler stack for CPUs, GPUs, and specialized accelerators. TVM employs a unique approach where computation specification and optimization schedules are decoupled, enabling hardware-specific optimizations while maintaining algorithmic correctness.

Implemented Operations

This repository contains optimized implementations for the following DNN primitives across multiple hardware targets:

CPU Optimizations

  • 1D Convolution: Vectorization, loop tiling, and cache-aware blocking strategies

GPU Optimizations

  • 1D Convolution: Thread binding, shared memory utilization, and cooperative fetching
  • General Matrix Multiplication (GEMM): Tiled computation with multi-level memory hierarchy optimization
  • 2D Depthwise Separable Convolution: Spatial locality optimization and thread coarsening

Key Optimization Techniques

The implementations showcase various performance optimization strategies:

  • Memory Hierarchy Optimization: Efficient use of shared memory, local caches, and register files
  • Loop Transformations: Tiling, unrolling, and reordering for improved data locality
  • Parallelization: Strategic thread binding and block configuration for GPU execution
  • Vectorization: SIMD instruction utilization for CPU implementations
  • Compute Scheduling: Optimal operation ordering and data movement minimization

Performance Validation

Each implementation includes comprehensive testing that validates:

  • Functional Correctness: Bit-exact matching against reference NumPy/PyTorch implementations
  • Performance Benchmarks: Execution time measurements and speedup analysis
  • Cross-Platform Compatibility: Consistent behavior across different hardware targets

Repository Structure

src/
├── ops.py          # Core optimization implementations
└── __init__.py

notebooks/
├── 1-conv1d_cpu.ipynb     # Interactive CPU optimization
├── 2-conv1d_gpu.ipynb     # Interactive GPU optimization
├── 3-conv1d_fpga.ipynb    # Interactive FPGA optimization
├── 4-gemm_gpu.ipynb       # Interactive GEMM optimization
└── 5-conv2d_dw_gpu.ipynb  # Interactive 2D convolution optimization

Getting Started

Prerequisites

  • Apache TVM (recommend using tlcpack nightly builds)
  • CUDA toolkit (for GPU implementations)
  • Python 3.7+
  • NumPy, PyTorch (for reference implementations)

Interactive Development

Launch the Jupyter notebooks for interactive optimization and analysis:

jupyter notebook notebooks/

Performance Results

The optimized implementations achieve significant speedups over baseline implementations:

  • CPU 1D Convolution: Up to 8x speedup through vectorization and blocking
  • GPU GEMM: Competitive performance with cuBLAS through advanced tiling strategies
  • GPU 2D Convolution: Optimized memory access patterns for improved throughput

Technical Approach

This project demonstrates modern compiler optimization techniques for machine learning workloads, emphasizing:

  • Hardware-aware optimization: Tailoring algorithms to specific architectural features
  • Systematic performance engineering: Iterative optimization with measurable improvements
  • Cross-platform portability: Unified optimization framework across diverse hardware

About

TVM compiler test for DNN operations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published