SUS library that provides AXI slaves and masters for Xilinx' XRT
Basics that are provided:
- AXI control slave: Provides input & output registers. Output registers only useable in XRT User-Managed Kernels
- AXI Memory master: Simple low-bandwidth DDR read & write
- AXI Memory master: High-bandwidth bursting AXI reader
- AXI Memory master: (Unfinished) High-bandwidth bursting AXI writer
Extrapolated from various benchmarks, more info in MIXED.md, 24x512.md, 20x256.md
Single-Reader Measurements at 355.2MHz
| AXI Width | Bandwidth | % useful cycles |
|---|---|---|
| 32 | 1.41GB/s | 100% |
| 64 | 2.82GB/s | 100% |
| 128 | 5.56GB/s | 98% |
| 256 | 10.50GB/s | 93% |
| 512 | 13.55GB/s | 60% |
The startup latency (so latency between a read request being made, and the first data element arriving) appears to be 70 cycles for 32-wide AXI, but 50 cycles for 64-wide. Larger widths weren't measured
(Combined from 512-bit@320MHz and 256-bit@348MHz benchmarks)
| #parallel | 512-bit BW (GB/s) | 256-bit BW (GB/s) |
|---|---|---|
| 1 | 13.5623 | 10.4925 |
| 2 | 25.1748 | 20.9844 |
| 3 | 22.0688 | 27.6769 |
| 4 | 29.2881 | 33.6021 |
| 5 | 25.6728 | 31.8993 |
| 6 | 17.7485 | 37.9892 |
| 7 | 27.5624 | 44.2615 |
| 8 | 31.4736 | 48.7837 |
| 9 | 35.4239 | 44.8638 |
| 10 | 39.2506 | 49.1469 |
| 11 | 42.9778 | 29.0227 |
| 12 | 46.4071 | 46.3362 |
| 13 | 47.063 | 48.4111 |
| 14 | 50.8537 | 50.2339 |
| 15 | 54.0014 | 53.9341 |
| 16 | 47.826 | 53.3576 |
| 17 | 44.5466 | 56.3916 |
| 18 | 46.6465 | 57.1011 |
| 19 | 47.7573 | 54.6633 |
| 20 | 49.944 | 56.767 |
| 21 | 52.1049 | N/A |
| 22 | 51.6131 | N/A |
| 23 | 52.3697 | N/A |
| 24 | 47.9105 | N/A |
(we're simply starting from kernel #1, and adding kernels sequentially)
Total bandwidth does tend to increase with more readers, but some conflicts seem to cut the bandwidth significantly (Such as 256-bit/11 parallel).
Peak bandwidth ever measured: ~57GB/s.
It appears that NoC interfaces on the same Vertical NoC conflict. Worse - NoC interfaces on the same VNoC sometimes conflict so badly that total bandwidth is less than if a single interface were communicating.
If more memory masters than this are instantiated, programmable logic "virtual" NoC switches are instantiated. Single-interface bandwidth appears maintained, but multi-interface bandwidth on the same virtual NoC suffers tremendously. Recommendation: Don't exceed 23 Interfaces
Observe the NoC endpoint isn't directly connected to the two pink kernels - Kernel 1 and 24 - the worst pairing in the conflicts benchmark. Instead, the large blob of orange logic is the virtual extension to the NoC, which both kernels then connect to.
From benchmarking the 256-bit case, it appears for optimal bandwidth 64 elements over the burst size is good enough. For 512-bit, a slightly lower bound is good enough, and may allow smaller FIFOs.
| AXI_WIDTH | MAX_IN_FLIGHT |
|---|---|
| 32 | 320 |
| 64 | 320 |
| 128 | 320 |
| 256 | 192 |
| 512 | 110 |
- Around 460MHz 256-bit AXI readers attain identical bandwidth to 512-bit readers
- ArCACHE[1] bit does not seem to have an effect
- The VCK5000 does not appear to have NUMA-like memory regions. While
kernel.group_id(0)returns different values, buffers created with these have no appreciable difference in access bandwidth. - There is only one Memory Bank
- No Host DMA is supported
- Rarely XRT has a 'blip', which includes a 500ms delay after a set of kernels finish
