A simple gravitational N-body simulation in less than 100 lines of C code, with CUDA optimizations.
There are 5 different benchmarks provided for CUDA and MIC platforms.
- nbody-orig : the original, unoptimized simulation (also for CPU)
- nbody-soa : Conversion from array of structures (AOS) data layout to structure of arrays (SOA) data layout
- nbody-flush : Flush denormals to zero (no code changes, just a command line option)
- nbody-block : Cache blocking
- nbody-unroll / nbody-align : platform specific final optimizations (loop unrolling in CUDA, and data alignment on MIC)
nbody.c : simple, unoptimized OpenMP C code timer.h : simple cross-OS timing code
Each directory below includes scripts for building and running a "shmoo" of five successive optimizations of the code over a range of data sizes from 1024 to 524,288 bodies.
cuda/ : folder containing CUDA optimized versions of the original C code (in order of performance on Tesla K20c GPU)
- nbody-orig.cu : a straight port of the code to CUDA (shmoo-cuda-nbody-orig.sh)
- nbody-soa.cu : conversion to structure of arrays (SOA) data layout (shmoo-cuda-nbody-soa.sh)
- nbody-soa.cu + ftz : Enable flush denorms to zero (shmoo-cuda-nbody-ftz.sh)
- nbody-block.cu : cache blocking in CUDA shared memory (shmoo-cuda-nbody-block.sh)
- nbody-unroll.cu : addition of "#pragma unroll" to inner loop (shmoo-cuda-nbody-unroll.sh)
mic/ : folder containing Intel Xeon Phi (MIC) optimized versions of the original C code (in order of performance on Xeon Phi 7110P)
- ../nbody-orig.cu : original code (shmoo-mic-nbody-orig.sh)
- nbody-soa.c : conversion to structure of arrays (SOA) data layout (shmoo-mic-nbody-soa.sh)
- nbody-soa.cu + ftz : Enable flush denorms to zero (shmoo-mic-nbody-ftz.sh)
- nbody-block.c : cache blocking via loop splitting (shmoo-mic-nbody-block.sh)
- nbody-align.c : aligned memory allocation and vector access (shmoo-mic-nbody-align.sh)