Replies: 1 comment
-
|
Hi, you are on the right track. If you need it quickly, implementing a new HLS custom op is probably the best option. My colleague Christoph (@iksnagreb) also plans to implement something like this in a more generic way, maybe he can chime in. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi FINN developers and community,
I'm working on accelerating a Temporal Convolutional Network (TCN) model using FINN for deployment on a Pynq-Z1. The TCN architecture involves multiple blocks, and due to the nature of 'valid' convolutions, the temporal dimension shrinks through the network.
Goal:
After the final TCN block, I have an activation tensor (e.g., shape [N, C, 168, 1]). To reduce downstream computations (specifically before the final fully connected layer equivalent), I only need a specific segment of this tensor along the temporal dimension (e.g., the central element, resulting in [N, C, 1, 1]). My goal is to implement this selection operation within the FPGA hardware dataflow generated by FINN for maximum efficiency.
Tracing this backwards this would potentially reduce a lot of computations for me by introducing slice nodes which only keeps the relevant indices for the final 1 output of the network:
Input:
Input size [1,1000] --> Input size [1,665]
Block1:
LayerName [input_tensor_size] --> [potentially_sliced_tensor_size]
Conv1 [1,1,1000,1] --> [1,1,665,1]
BatchNorm1 [1,4,496,1] --> [1,4,329,1]
Conv2 [1,4,496,1] --> [1,4,329,1]
SLICE here could reduce 329 to 81
BatchNorm2 [1,4,488,1] --> [1,4,81,1]
ReLU [1,4,488,1] --> [1,4,81,1]
Block2:
Conv1 [1,4,488,1] --> [1,4,81,1]
BatchNorm1 [1,8,456,1] --> [1,8,73,1]
Conv2 [1,8,456,1] --> [1,8,73,1]
SLICE here could reduce 73 to 17
BatchNorm2 [1,8,424,1] --> [1,8,17,1]
ReLU [1,8,424,1] --> [1,8,17,1]
Block3:
Conv1 [1,8,424,1] --> [1,8,17,1]
BatchNorm1 [1,16,296,1] --> [1,16,9,1]
Conv2 [1,16,296,1] --> [1,16,9,1]
SLICE here could reduce 9 to 1
BatchNorm2 [1,16,168,1] --> [1,16,1,1]
ReLU [1,16,168,1] --> [1,16,1,1]
For reference I already know which indexes to keep in each layer for each tensor.
The issue I face is just the missing functionality for SLICE operation in FINN. So I have spent a few days trying different workarounds.
Attempts and Failures:
Normal slicing in pytorch between layers:
Pruning by modifying the PruneChannels from qonnx/fastmachinelearning to PruneSamples
Potential paths & Questions:
Based on discussions and my understanding of FINN, I only see one potential path forward:
Make a Custom HLS Streaming Slice or Gather Layer:
Method: Develop a new custom HLS layer specifically for streaming slice operations on activations. This would involve writing HLS code for an AXI-Stream component that selectively passes through input data based on indices or a pattern, defining a new FINN custom op (StreamingSlice), and creating a FINN transformation pass to replace the ONNX Slice/Gather with this custom op.
Pros: Could be highly efficient at runtime (cycle-accurate selection, only processing desired data). Explicit control.
Cons/Question: Significant development effort required (HLS, AXI-Stream, FINN internals). Is this overkill? Are there existing examples or utilities within FINN that might simplify this process if deemed necessary?
Given these options:
What is the recommended approach within the FINN ecosystem for implementing this kind of activation tensor selection/slicing in hardware?
Are there any other FINN transformations or techniques I might have missed that could handle this?
Are there specific pitfalls or pointers the community could share regarding custom HLS layer development for this type of streaming data manipulation?
Thanks in advance for any insights or guidance!
Beta Was this translation helpful? Give feedback.
All reactions