SAMO: Automatic SIMD and PE optimization for FINN #693
                  
                    
                      AlexMontgomerie
                    
                  
                
                  started this conversation in
                Show and tell
              
            Replies: 0 comments
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
SAMO: Streaming-Architecture to FPGA Mapping Optimiser
When designing a DNN Accelerator for an FPGA device, there is a constant trade-off between exploiting parallelism for extra performance, and the resource cost associated with it. For Systolic Array Architectures, this trade-off is a lot more straightforward, where the only tunable parallelism dimensions are the PE array size. However, for Streaming Architectures such as FINN, HLS4ML and fpgaConvNet, the design space is a lot greater. Streaming Architectures tend to map each layer of the DNN model to an individual hardware block, which has its own tunable performance parameters. This large design space has no straightforward approach for solving, thus we have provided a toolbox, samo which utilises existing optimisation solvers to address this problem.
Our tool solves the optimisation problem of getting the best performance out of a design whilst staying within resource limits. It removes the complicated and tedious task of tuning performance for a given platform-network pair. The rapid design space exploration performed by SAMO gets out an optimal hardware configuration for the DNN model, leaving the designer to solely focus on their application.
For FINN in particular, there are generally two parallelism dimensions for each layer: the input channel parallelism$\mathbf{s}^i$  refers to the number of SIMD lanes for  all convolution layers; and the output channel parallelism $\mathbf{s}^o$ , which represents the number of PEs for the same convolution layers. Given the objective of minimising latency $L$ , and the constraint on resource $\mathcal{R}$  we can define an optimisation problem:
We also introduce further constraints on the parallelism values, such as them being factors of the channel dimension, and so on. Details of which can be found in our paper as well as the code.
Setup
Currently, samo is executed within the FINN docker image. The first step is to clone the samo project:
Then clone a fork of finn which is compatible with the samo tool. You may want to merge in the version of FINN you are currently using.
Finally, set
SAMO_DIRto the path of the downloaded samo repo in yourrun-docker.sh, before entering the docker.Usage
The original FINN compiler contains many transformation passes that modify the ONNX representation all the way to hardware. SAMO is integrated into this transformation flow by pausing the compiler at the pass when a "dataflow partition" is generated.
The transformation passes before the "dataflow partition" stage are referred to as "pre_optimiser_steps" and they produce the "${network}_pre_optimiser.onnx" for optimisation.
SAMO then takes over the optimsation of FINN-ONNX, performing the Design Space Exploration, and setting the appropriate SIMD and PE numbers. SAMO exports the optimised FINN-ONNX in "${network}_post_optimiser.onnx"
Finally, the following command is used to resume the compilation of FINN and generate the hardware.
Compatibility
Using the provided fork of FINN is not mandatory, in case you would like to try SAMO on the latest version of FINN. All you need is to mount the SAMO folder in docker, break the FINN compiltation after the "dataflow partition" pass, and feed the corresponding ONNX files into SAMO.
Citation
Please feel free to ask any questions about the tool, or how to use it!
Beta Was this translation helpful? Give feedback.
All reactions