Project to benchmark DPU-accelerated Breadth-First Search. DPUs are tiny in-memory ARM processors. 1 DPU is built into the RAM for every 64MBs, for a total of 2048 DPUs on a 128GB server. DPUs are used to accelerate specific memory-bound workloads up to x100. This project uses BFS as a demonstration of this capability.
Note: Tested with UPMEM DPU SDK 2020.3.x.
$ make
Optional environment variables for make:
BENCHMARK_TIME=truebenchmarks the BFS duration in seconds.BENCHMARK_CYCLES=truecounts the number of DPU cycles per BFS iteration.NR_TASKLETS=<integer>sets the number of tasklets per DPU (max 24, recommended 11).BLOCK_SIZE=<multiple_of_8>sets the MRAM DMA block size (multiple of 8, max 512 bytes).
$ ./bin/bfs -n <num_dpu> -a <base_algorithm> -p <partitioning> -o <output_result_path> <datafile>
Notes:
num_dpumust be a multiple of 8. Emulator has a limit of 64.datafileCOO-formated graph (adjacency list) that is tab separated, and sorted by the first column then the second column. The first line contains the number of nodes followed by the number of edges. See example below.base_algorithmis the base BFS algorithm to use, with options:topfor vertex-centric top-down BFS.botfor vertex-centric bottom-up BFS.edgefor edge-centric BFS.
partitioningthe way the adjacency matrix is partitioned over the DPUs, with options:rowpartition the source nodes (i.e. nodes).colpartition the destination nodes (i.e. neighbors).2dpartition both source nodes and destination nodes in tiles.
Example datafile:
<NUM_NODES> <NUM_EDGES>
0 1
0 2
0 11
1 3
1 4
...
bfs-dpu/
host/ # Host code (CPU side)
dpu/ # Task code (DPU side). Contains optimized DMA versions and non-optimized reader-friendly versions.