TriPIM is an Extension of the TriCORE Approach Using UPMEM and PIM concepts. The TriCORE method introduced an innovative technique for triangle counting in graph analytics, utilizing a binary search-driven mechanism to improve thread parallelism and memory efficiency. In this study, we present TriPIM, which builds upon the foundational principles of TriCORE and integrates with UPMEM PIM technology. This integration aims to further optimize the graph triangle infrastructure by leveraging the advantages of both the TriCORE approach and the capabilities offered by UPMEM.
This document provides instructions on how to build and run various components of the project using the provided Makefile. The Makefile simplifies the compilation and execution process for both CPU and GPU targets, as well as for DPU (Data Processing Units) targets.
Before building and running the benchmarks, ensure you have the following installed:
- GNU Compiler Collection (GCC) for C++ compilation
- NVIDIA CUDA Toolkit for GPU code compilation
- Python3 for running GPU benchmarks
- UPMEM DPU Toolchain for compiling and executing DPU benchmarks
Make sure that the g++ and nvcc compilers are accessible in your system's PATH. Additionally, the DPU toolchain must be properly configured if you intend to run the DPU benchmarks.
The Makefile includes several targets to facilitate building, running, and managing the project components:
-
all: Compiles all benchmarks, including GAP benchmark suite components and the TriPIM benchmark for CPU, GPU, and DPU platforms.make all
-
clean: Removes all build artifacts, including binaries and intermediate files, from thebinandlibdirectories, along with Python cache files.make clean
-
help: Lists all available Makefile commands along with a brief description of each.make help
-
tc_cpu: Compiles the Triangle Counting (TC) CPU version of the GAP benchmark suite.make tc_cpu
-
gap converter: Builds the graph format converter utility, part of the GAP benchmark suite.make converter
-
tc_upmem: Builds the Host side and DPU task of the TriPIM benchmark.make tc_upmem
-
run-tc_cpu: Executes the CPU version of the TriPIM benchmark with predefined input parameters.make run-tc_cpu
-
run-tc_upmem: Simulates the TriPIM benchmark on the host system, ideal for DPU functional simulation.make run-tc_upmem
The Makefile provides several targets for building and running specific benchmarks:
make run-tc_cpu: Compiles and runs thetc_cpubenchmark (CPU-based)make run-tc_upmem: Compiles and runs thetc_upmembenchmark (DPU-based)make run-%: Runs a specified GAP benchmark (replace%withtc_cpuortc_upmem)
Each benchmark offers various flags for customization. Refer to the specific benchmark's help message for details:
All of the binaries use the same command-line options for loading graphs:
-g 20generates a Kronecker graph with 2^20 vertices (Graph500 specifications)-u 20generates a uniform random graph with 2^20 vertices (degree 16)-f graph.elloads graph from file graph.el-sf graph.elsymmetrizes graph loaded from file graph.el
The graph loading infrastructure understands the following formats:
.elplain-text edge-list with an edge per line as node1 node2.welplain-text weighted edge-list with an edge per line as node1 node2 weight.gr9th DIMACS Implementation Challenge format.graphMetis format (used in 10th DIMACS Implementation Challenge).mtxMatrix Market format.sgserialized pre-built graph (useconverterto make).wsgweighted serialized pre-built graph (useconverterto make)
The Makefile defines various targets for managing the build process, cleaning, and running benchmarks. Here's a summary of some key targets:
all: Builds all targets (including GAP benchmarks)clean: Removes all build artifactsclean-all: Removes build artifacts and results directoriesscrub-all: Performs a more extensive cleanup (including backups)run-%: Runs a specified GAP benchmarkhelp: Displays a list of available make commands and their descriptionshelp-%: Provides help for a specific benchmark
Several environment variables and Makefile settings control the build process:
CXX: C++ compiler (default: g++)UPMEM_NR_TASKLETS: Number of Upmem tasklets (default: 16)UPMEM_NR_DPUS: Number of DPUs (default: 1)UPMEM_PROBLEM_SIZE: Problem size (default: 2)CXXFLAGS_GAP: Compiler flags for GAP benchmarksUPMEM_HOST_FLAGS: Compiler flags for Upmem host codeUPMEM_DPU_FLAGS: Compiler flags for Upmem DPU code
For more detailed information about each command and how to use the benchmarks, refer to the help command (make help) or the individual benchmark documentation provided within the project.
bin/: Contains compiled executables for the GAP benchmark suite and the TriPIM CPU benchmark.lib/: Contains the shared library for the TriPIM GPU benchmark.src/: Contains source code for the project, including the GAP benchmark suite and the TriPIM benchmark.
GAP Benchmark Suite is designed to be a portable high-performance baseline that only requires a compiler with support for C++11. It uses OpenMP for parallelism, but it can be compiled without OpenMP to run serially. The details of the benchmark can be found in the specification.
The GAP Benchmark Suite is intended to help graph processing research by standardizing evaluations. Fewer differences between graph processing evaluations will make it easier to compare different research efforts and quantify improvements. The benchmark not only specifies graph kernels, input graphs, and evaluation methodologies, but it also provides an optimized baseline implementation (this repo). These baseline implementations are representative of state-of-the-art performance, and thus new contributions should outperform them to demonstrate an improvement.
-
TRICORE, a GPU-optimized triangle counting system distinguished by three core techniques:
- Binary Search Algorithm: Designed to bolster thread parallelism and memory efficiency on GPUs, filling gaps from earlier models.
- Graph Representation Streamlining: Unlike previous methods that demanded various graph representations (like CSR, edge list, and bitmap) in the GPU memory, TRICORE uniquely distributes partitioned CSR data among GPUs. Additionally, it employs a streaming buffer, allowing edge lists to be fetched directly from CPU memory. This strategy empowers TRICORE to handle graphs substantially larger than typical GPU memory capacities.
- Dynamic Workload Management: Crafted to ensure a balanced GPU workload distribution.
-
Performance Insights:
- TRICORE processed the billion-edge Twitter graph in just 24 seconds on a single GPU—a staggering 22 times faster compared to leading CPU-based methods, even when those CPUs cost 8 times more.
- For expansive graphs (up to 33.4 billion edges) that dwarf a single GPU's memory by about 22 times, TRICORE achieves a 24-fold performance increase as the system scales from 1 to 32 GPUs.
TriPIM is based on PrIM the first benchmark suite for a real-world processing-in-memory (PIM) architecture. PrIM is developed to evaluate, analyze, and characterize the first publicly-available real-world processing-in-memory (PIM) architecture, the UPMEM PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.
PrIM provides a common set of workloads to evaluate the UPMEM PIM architecture with and can be useful for programming, architecture and system researchers all alike to improve multiple aspects of future PIM hardware and software. The workloads have different characteristics, exhibiting heterogeneity in their memory access patterns, operations and data types, and communication patterns. This repository also contains baseline CPU and GPU implementations of PrIM benchmarks for comparison purposes.
PrIM also includes a set of microbenchmarks can be used to assess various architecture limits such as compute throughput and memory bandwidth.
- Triangle Counting (TC) - Order invariant with possible relabelling
- CPU - GAP
- GPU - TRICORE
- PIM - UPMEM
Please cite the following papers if you find this repository useful.
- Scott Beamer, Krste Asanović, David Patterson. "The GAP Benchmark Suite". arXiv:1508.03619 [cs.DC], 2015.
- Hu, Yang, Hang Liu, and H. Howie Huang. "Tricore: Parallel triangle counting on gpus". SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018.
- Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu, "Benchmarking Memory-centric Computing Systems: Analysis of Real Processing-in-Memory Hardware". 2021 12th International Green and Sustainable Computing Conference (IGSC). IEEE, 2021.