Skip to content

pc2/sus-xrt

Repository files navigation

AXI Slaves and Masters for XRT

SUS library that provides AXI slaves and masters for Xilinx' XRT

Basics that are provided:

  • AXI control slave: Provides input & output registers. Output registers only useable in XRT User-Managed Kernels
  • AXI Memory master: Simple low-bandwidth DDR read & write
  • AXI Memory master: High-bandwidth bursting AXI reader
  • AXI Memory master: (Unfinished) High-bandwidth bursting AXI writer

Lessons Learned - VCK5000

Extrapolated from various benchmarks, more info in MIXED.md, 24x512.md, 20x256.md

Bandwidths

Single-Reader Measurements at 355.2MHz

AXI Width Bandwidth % useful cycles
32 1.41GB/s 100%
64 2.82GB/s 100%
128 5.56GB/s 98%
256 10.50GB/s 93%
512 13.55GB/s 60%

The startup latency (so latency between a read request being made, and the first data element arriving) appears to be 70 cycles for 32-wide AXI, but 50 cycles for 64-wide. Larger widths weren't measured

Multi-reader bandwidth

(Combined from 512-bit@320MHz and 256-bit@348MHz benchmarks)

#parallel 512-bit BW (GB/s) 256-bit BW (GB/s)
1 13.5623 10.4925
2 25.1748 20.9844
3 22.0688 27.6769
4 29.2881 33.6021
5 25.6728 31.8993
6 17.7485 37.9892
7 27.5624 44.2615
8 31.4736 48.7837
9 35.4239 44.8638
10 39.2506 49.1469
11 42.9778 29.0227
12 46.4071 46.3362
13 47.063 48.4111
14 50.8537 50.2339
15 54.0014 53.9341
16 47.826 53.3576
17 44.5466 56.3916
18 46.6465 57.1011
19 47.7573 54.6633
20 49.944 56.767
21 52.1049 N/A
22 51.6131 N/A
23 52.3697 N/A
24 47.9105 N/A

(we're simply starting from kernel #1, and adding kernels sequentially)

Total bandwidth does tend to increase with more readers, but some conflicts seem to cut the bandwidth significantly (Such as 256-bit/11 parallel).

Peak bandwidth ever measured: ~57GB/s.

Conflicts

It appears that NoC interfaces on the same Vertical NoC conflict. Worse - NoC interfaces on the same VNoC sometimes conflict so badly that total bandwidth is less than if a single interface were communicating.

Conflicting NoCs

Number of hard-logic NoC connection points: 23

If more memory masters than this are instantiated, programmable logic "virtual" NoC switches are instantiated. Single-interface bandwidth appears maintained, but multi-interface bandwidth on the same virtual NoC suffers tremendously. Recommendation: Don't exceed 23 Interfaces

Programmable Logic NoC Observe the NoC endpoint isn't directly connected to the two pink kernels - Kernel 1 and 24 - the worst pairing in the conflicts benchmark. Instead, the large blob of orange logic is the virtual extension to the NoC, which both kernels then connect to.

Optimal MAX_IN_FLIGHT values on VCK5000

From benchmarking the 256-bit case, it appears for optimal bandwidth 64 elements over the burst size is good enough. For 512-bit, a slightly lower bound is good enough, and may allow smaller FIFOs.

AXI_WIDTH MAX_IN_FLIGHT
32 320
64 320
128 320
256 192
512 110

Misc

  • Around 460MHz 256-bit AXI readers attain identical bandwidth to 512-bit readers
  • ArCACHE[1] bit does not seem to have an effect
  • The VCK5000 does not appear to have NUMA-like memory regions. While kernel.group_id(0) returns different values, buffers created with these have no appreciable difference in access bandwidth.
  • There is only one Memory Bank
  • No Host DMA is supported
  • Rarely XRT has a 'blip', which includes a 500ms delay after a set of kernels finish

About

SUS AXI4 Interface for XRT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published