Skip to content
Open
Show file tree
Hide file tree
Changes from 40 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
eaa7433
[Shuffle] Initial migration of brainsmith shuffle code over to FINN b…
STFleming Aug 26, 2025
33e8b76
[Shuffle] Tests are progressing further, stuck at cppsim compliation …
STFleming Aug 26, 2025
6b47c4f
[Shuffle] First migrated cpptests from BS are passing.@
STFleming Aug 28, 2025
38c9ea7
[Shuffle] Starting work on getting RTLSim tests to pass, changes need…
STFleming Aug 28, 2025
65e6a1b
[Shuffle] removing custom execute node as it was incorrect and not re…
STFleming Sep 1, 2025
ab81257
[Shuffle] Couldn't fall back on default execute_node due to shape iss…
STFleming Sep 1, 2025
90dca4a
[Shuffle] Added the first cut at the RTL CustomHWOp for the PTranspose
STFleming Sep 1, 2025
1a5e4bf
[Shuffle] Added a simple inference step that determines if we should …
STFleming Sep 1, 2025
acfb4b4
[Shuffle/PTranspose] Working through simple test case for PTranspose …
STFleming Sep 1, 2025
99d9a94
[Shuffle/PTranspose] Added missing template $
STFleming Sep 1, 2025
3a58d1c
[Shuffle/PTranspose] First PTranspose based tests are passing
STFleming Sep 1, 2025
7ca0f4e
[Shuffle/PTranspose] Adding a transformation that decomposes transpos…
STFleming Sep 2, 2025
88c99fb
[Shuffle] Not entirely sure what the correct one is here so I'm inclu…
STFleming Sep 2, 2025
cc1a150
[Shuffle/Ptranspose] Addressing some of the issues and bugs during in…
STFleming Sep 3, 2025
1bf4c76
[Shuffle/PTranspose] Fixing issue with ptranspose that was causing de…
STFleming Sep 3, 2025
f1b336d
[Shuffle/Ptranspose] Better transpose decomposition method based on t…
STFleming Sep 3, 2025
ed0b302
[Shuffle/Ptranspose] Added a bunch more tests and tidied up the test …
STFleming Sep 4, 2025
f6ed9d3
[Shuffle/PTranspose] Shuffle and PTranspose now attempt to adjust the…
STFleming Sep 4, 2025
ae55c51
[Shuffle/Ptranspose] Removing test that is not feasible for the given…
STFleming Sep 4, 2025
9499987
[Shuffle/Ptranspose] Cppsim tests are now functional and passing
STFleming Sep 4, 2025
5ebbcb3
[Shuffle/PTranspose] removing debug model saving from pytests
STFleming Sep 4, 2025
78200ed
Merge branch 'dev' into feature/shuffle
STFleming Sep 4, 2025
f212dba
[Shuffle/PTranspose] Changing the format of the output name to match …
STFleming Sep 4, 2025
e152ac0
[Shuffle/PTranspose] Added a stitchedIP generation test (WIP needs so…
STFleming Sep 4, 2025
2518338
[Shuffle/PTranspose] Temporary reduce the HLS clock to 100MHz to matc…
STFleming Sep 4, 2025
cd75e1f
[Shuffle/PTranspose] Fixing pre-commit issues and adding appropriate …
STFleming Sep 8, 2025
af56242
[Shuffle/PTranspose] Added more test flags and checks of environ vari…
STFleming Sep 8, 2025
41bf90c
[Shuffle/PTranspose] pre-commit fix
STFleming Sep 8, 2025
efee80c
[Shuffle/PTranspose] Cleaning up some unused operators
STFleming Sep 19, 2025
ed1b7de
[Shuffle] Renaming PTranspose CustomOp to InnerShuffle to increase cl…
STFleming Sep 19, 2025
31b914a
[Shuffle] Changing the previous shuffle to an OuterShuffle CustomHWOp
STFleming Sep 19, 2025
e2a3ae5
[Shuffle] Cleaning up issues with OuterShuffle renaming.
STFleming Sep 22, 2025
c0099b7
[Shuffle] Changing the order of when the transpose decomposition happ…
STFleming Sep 22, 2025
b329c79
[Shuffle] Cleanup
STFleming Sep 22, 2025
c45cfda
[Shuffle] removing transpose decomposition test, it made more sense w…
STFleming Sep 22, 2025
91b5a41
[Shuffle] fix how SIMD is now being applied for the new SIMD aware de…
STFleming Sep 22, 2025
9a6347e
[Shuffle] Add the transpose transformations to the default builder steps
STFleming Sep 22, 2025
a8fb5ac
[Shuffle] Changing the name of the pT and iG operators in the decompo…
STFleming Sep 22, 2025
2c5d955
[Shuffle] Reducing the number of stitched ip tests to keep down testi…
STFleming Sep 23, 2025
cfb6bb2
[Shuffle] Adding missing interface parameter.
STFleming Sep 23, 2025
64e0ae6
[OuterShuffle] Prune hls code gen functions to use parent methods ins…
auphelia Sep 24, 2025
9ef22e2
[OuterShuffle] Removing temp utils and input_gen.hpp code and updatin…
STFleming Sep 25, 2025
b88fd4f
[Shuffle] Changing the naming convention of the in_reshape and out_re…
STFleming Sep 25, 2025
52e2430
[Shuffle] Fixing issue with the first reshape being lost during shuff…
STFleming Sep 26, 2025
844357c
[Shuffle] Fixing issue with first node reshape during decomposition; …
STFleming Sep 26, 2025
6094d99
[Shuffle] Added comment as to why Shuffles are skipped during the spe…
STFleming Sep 26, 2025
ae9647f
[Shuffle] cleanup iterator
STFleming Sep 26, 2025
a898752
[Shuffle] Removing unecessary node_ind decrement
STFleming Sep 26, 2025
d7d20c9
Merge branch 'dev' into feature/shuffle
auphelia Sep 29, 2025
cfe916a
[Tests] Change target fpga for shuffle test to be able to test with 2…
auphelia Sep 29, 2025
6b73902
Run pre-commit
auphelia Sep 29, 2025
31f2658
[Shuffle] When exporting the final_hw_config export the initial SIMD …
STFleming Oct 3, 2025
526f767
[Shuffle] Remove double defined function and throw assertions instead…
auphelia Oct 9, 2025
41be57d
[Shuffle] Refactor code
auphelia Oct 9, 2025
bb5d446
[InnerShuffle] Making RTL match what is expected in SV style guide
STFleming Oct 13, 2025
c371626
[Shuffle] Special casing shuffle in the rtlsim io_dict preparation
STFleming Oct 15, 2025
caab2cf
Merge branch 'dev' into feature/shuffle
auphelia Oct 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions custom_hls/bs_utils.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
/****************************************************************************
* Copyright (C) 2025, Advanced Micro Devices, Inc.
* All rights reserved.
*
* SPDX-License-Identifier: MIT
*
* @author Shane T. Fleming <[email protected]>
****************************************************************************/

#ifndef SM_UTIL_HPP
#define SM_UTIL_HPP
#include "hls_vector.h"

//- Compile-Time Functions --------------------------------------------------

// ceil(log2(x))
//template<typename T>
//constexpr unsigned clog2(T x) {
// return x<2? 0 : 1+clog2((x+1)/2);
//}

//- Streaming Flit with `last` Marking --------------------------------------
template<typename T>
struct flit_t {
bool last;
T data;

public:
flit_t(bool last_, T const &data_) : last(last_), data(data_) {}
~flit_t() {}
};

//- Streaming Copy ----------------------------------------------------------
template<typename T>
void move(hls::stream<T> &src, hls::stream<T> &dst) {
#pragma HLS pipeline II=1 style=flp
if(!src.empty()) dst.write(src.read());
}

//- Tree Reduce -------------------------------------------------------------
template< unsigned long N, typename TA, typename TR = TA, typename F >
TR tree_reduce(hls::stream<TA> &v, F f) {
#pragma HLS inline
#pragma HLS function_instantiate variable=f
TR tree[2*N-1];
#pragma HLS array_partition complete dim=1 variable=tree
for(unsigned i = N; i-- > 0;) {
#pragma HLS unroll
tree[N-1 + i] = v.read();
}
for(unsigned i = N-1; i-- > 0;) {
#pragma HLS unroll
tree[i] = f(tree[2*i+1], tree[2*i+2]);
}
return tree[0];
}

// Recursive comparison and count (of max)
// Builds a tree to compute the max of a vector
template<unsigned N, typename T>
struct MaxReduction {

static T max(const hls::vector<T, N>& input) {
#pragma HLS INLINE
constexpr unsigned M = (N + 1) / 2;
hls::vector<T, M> res;

for(unsigned i = 0; i < M; ++i) {
#pragma HLS unroll
if (2*i + 1 < N)
res[i] = input[2*i] > input[2*i + 1] ? input[2*i] : input[2*i + 1];
else
res[i] = input[2*i]; // Handle the case where the input size is odd
}

return MaxReduction<M, T>::max(res);
}

};

template<typename T>
struct MaxReduction<2, T> {
static T max(const hls::vector<T, 2>& input) {
#pragma HLS INLINE
return (input[0] > input[1]) ? input[0] : input[1];
}
};

template<typename T>
struct MaxReduction<1, T> {
static T max(const hls::vector<T, 1>& input) {
#pragma HLS INLINE
return input[0];
}
};

// Recursive reduction tree for the total summation
// Code for the Nth stage
template<typename T, unsigned N>
struct TreeReduction {
static T reduce(const hls::vector<T, N>& input) {
#pragma HLS INLINE
constexpr unsigned M = (N + 1) / 2;
hls::vector<T, M> sum;

for(unsigned i = 0; i < M; ++i) {
#pragma HLS unroll
if (2*i + 1 < N)
sum[i] = input[2*i] + input[2*i + 1];
else
sum[i] = input[2*i]; // Handle the case where the input size is odd
}

return TreeReduction<T, M>::reduce(sum);
}
};

template<typename T>
struct TreeReduction<T, 2> {
static T reduce(const hls::vector<T, 2>& input) {
#pragma HLS INLINE
return input[0] + input[1];
}
};

template<typename T>
struct TreeReduction<T, 1> {
static T reduce(const hls::vector<T, 1>& input) {
#pragma HLS INLINE
return input[0];
}
};

#endif
237 changes: 237 additions & 0 deletions custom_hls/input_gen.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
/****************************************************************************
* Copyright (C) 2025, Advanced Micro Devices, Inc.
* All rights reserved.
*
* SPDX-License-Identifier: MIT
*
* @author Thomas B. Preußer <[email protected]>
****************************************************************************/

#ifndef INPUT_GEN_HPP
#define INPUT_GEN_HPP

#include <ap_int.h>
#include <hls_stream.h>
#include "bs_utils.hpp"

#include <algorithm>
#include <tuple>
#include <type_traits>

/**
* Computes the updates of the read and free pointers for a buffer read out by
* the specified loop nest.
*
* @param R also responsible for the update of the free pointer
* @param V loop nest specifiction, odd length, see specializations
*
* A given perfect loop nest:
*
* for(unsigned i0 = 0; i0 < N0; i0++) {
* for(unsigned i1 = 0; i1 < N1; i1++) {
* ...
* for(unsigned in = 0; in < Nn; in++) {
* emit(ifm[C0*i0 + C1*i1 + ... + Cn*in]);
* }
* ...
* }
* }
*
* encodes as:
*
* Nest<true, IFM_SIZE, N0, C0, N1, C1, ..., Nn, Cn>
*
* As this class computes relative updates by each invocation of `tick()`,
* an absolute offset must be reflected in the original pointer initialization.
* The contract for a directly enclosed loop is:
* - For the total of an entire period of increments, the cumulative read pointer
* updates amount to the number immediately preceding its execution count.
* - The free pointer is incremented in lockstep if R is true and if this loops
* own increments are positive and would fit entirely into a period of the
* enclosing loop.
* Currently, all coefficients Ci must be positive. The implication is that
* every completed loop induces a net non-negative read-pointer increment.
* Negative read pointer updates are only possible by loop termination leaving
* a net positive update for the enclosing loop but possibly retracting the read
* pointer back to the expected enclosing increment after overshooting
* internally.
* As each completed loop guarantees a net positive increment, negative pointer
* retractions never add up. Thus, the biggest retraction can be used to
* dimension provided buffer storage.
*/
template<bool R, unsigned... V>
class Nest {};

/**
* Terminal innermost loop.
*
* @param R also responsible for the update of the free pointer
* @param W represented increment of read pointer
*/
template<
bool R,
unsigned W
>
class Nest<R, W> {
public:
static constexpr unsigned rp_rewind = 0;
static constexpr unsigned fp_rewind = 0;

static constexpr int max_rp_retract = 0;

public:
std::tuple<int, unsigned, ap_int<1>> tick() {
#pragma HLS inline
return { W, R? W : 0, -1 };
}
};

/**
* Non-terminal loop.
*
* @param R also responsible for the update of the free pointer
* @param W represented increment of read pointer
* @param N iteration count of directly enclosed loop
* @param C increment of read pointer by directly enclosed loop
* @param V further nested loops
*
* - Each non-terminal loop will slice off two values, W & N, from the
* specification vector V.
* - The directly enclosed loop will inherit responsibility for the
* free pointer update only if it represents a strictly monotonic increase
* contained entirely within the pointer update of this loop.
*/
template<
bool R,
unsigned W,
unsigned N,
unsigned C,
unsigned... V
>
class Nest<R, W, N, C, V...> {

static constexpr bool R_INNER = R && (0 < C) && (C*N <= W);
using Inner = Nest<R_INNER, C, V...>;

public:
static constexpr unsigned rp_rewind = (N-1)*C + Inner::rp_rewind;
static constexpr unsigned fp_rewind = R_INNER? (N-1)*C + Inner::fp_rewind : 0;

private:
static constexpr int terminal_rp_inc = W - rp_rewind;
public:
static constexpr int max_rp_retract = std::max(-terminal_rp_inc, Inner::max_rp_retract);

private:
static_assert(N > 0, "Must have positive iteration count.");
ap_int<1+clog2(std::max(1u, N-1))> cnt = N-2; // N-2, N-1, ..., 1, 0, -1
Inner inner;

public:
std::tuple<int, unsigned, ap_int<2+sizeof...(V)/2>> tick() {
#pragma HLS inline
auto const t = inner.tick();
int rp_inc = std::get<0>(t);
unsigned fp_inc = std::get<1>(t);
ap_int<2+sizeof...(V)/2> term = std::get<2>(t);

if(term < 0) {
if(cnt < 0) {
rp_inc = terminal_rp_inc;
if(R) fp_inc = W - fp_rewind;
cnt = N-2;
}
else {
term[decltype(term)::width-1] = 0;
cnt--;
}
}
return { rp_inc, fp_inc, term };
}
};

/**
* Input generator:
* - over a feature map of pixels of type T
* - iterated over by the loop nest specified by V
* - optionally identifying the completion of a kernel produced by the M innermost loops.
*
* @param M innermost loop count constituting a kernel
* M < 0 - no `last` indicator on destination stream
* M >= 0 - `last` indicator on destination stream:
* 0 - always asserted
* 1 - upon completion of innermost loop
* M - upon completion of M innermost loops
* @param V loop nest descriptor, see above for Nest<>
* @param T (inferred) pixel type
*/
template<int M, unsigned... V, typename T>
void input_gen(
hls::stream<T> &src,
hls::stream<typename std::conditional<M < 0, T, flit_t<T>>::type> &dst
) {
#pragma HLS pipeline II=1 style=flp

// Write Pointer update delay needed to accommodate memory read-out latency.
constexpr unsigned WP_DELAY = 4;

using MyNest = Nest<true, V...>;
constexpr unsigned ADDR_BITS = clog2(2*MyNest::max_rp_retract + WP_DELAY);
constexpr unsigned BUF_SIZE = 1 << ADDR_BITS;
using ptr_t = ap_int<1 + ADDR_BITS>;

static MyNest nest;
static T buf[BUF_SIZE];
static ptr_t wp[WP_DELAY] = { 0, };
static ptr_t rp = 0;
static ptr_t fp = 0;
#pragma HLS reset variable=nest
#pragma HLS reset variable=buf off
#pragma HLS reset variable=wp
#pragma HLS reset variable=rp
#pragma HLS reset variable=fp
#pragma HLS dependence variable=buf inter false
#pragma HLS dependence variable=buf intra false
#pragma HLS array_partition variable=wp complete

static bool ovld = false;
static struct OBuf {
bool lst;
T dat;

public:
operator T const&() const { return dat; }
operator flit_t<T>() const { return { lst, dat }; }
} obuf;
#pragma HLS reset variable=ovld
#pragma HLS reset variable=obuf off

// Update delay pipeline for wp
for(unsigned i = WP_DELAY-1; i > 0; i--) wp[i] = wp[i-1];

// Read into buffer memory if capacity is available
if(/* wp <= fp' */ ptr_t(wp[0]-fp) >= 0) {
T x;
if(src.read_nb(x)) buf[ap_uint<ADDR_BITS>(wp[0]++)] = x;
}

// Try to clear output buffer
if(ovld) ovld = !dst.write_nb(obuf);

// Try to refill output buffer
if(!ovld) {
obuf.dat = buf[ap_uint<ADDR_BITS>(rp)];

if(/* rp < wp */ ptr_t(rp-wp[WP_DELAY-1]) < 0) {
auto const t = nest.tick();
rp += std::get<0>(t);
fp += std::get<1>(t);

if(M >= 0) obuf.lst = std::get<2>(t)[M];
ovld = true;
}
}

} // input_gen()

#endif
Loading