Troubleshooting Accelerator Performance #1348

OscarWahllof · 2025-05-30T10:58:53Z

OscarWahllof
May 30, 2025

Hello!

I am working on building an accelerator for a modulation classification. I have an 8bit model trained in Brevitas that classifies an I/Q signal of 1024 samples into one of 12 modulation classes. The model has very acceptable performance and after doing all the finn transformations the python/cpp sim shows correct results. Running the accelerator on KR260 the results are essentially garbage. I have added support for the KR in the source code and ran example networks, which work correctly, so it is unlikely to be the issue.
#463

My network consists of 1d convolution, fully connected layers, with relu and maxpool. Nothing exotic, and transformed network looks correct. I have applied quite limited folding, everything set to 1 with 128 sized FIFOs between layers. Model size is small and weights stored only on chip, no runtime weights.

I have a few questions:

Is there some guidance on how to trouble shoot this?
Outside of making a full testbench and comparing layer to layer activations with software sim I am quite lost on how I could address this, which is quite a timesink. Is there some what to do a neat selftest somehow?
Does too small sized FIFO only lead to pipline stall for simple feedforward networks, or can it cause overflows and garbage that way?
This seems to indicate not? Questions about the FIFO depth between layers #383
Are there problems with the driver generation in general or should I expect the folding and packing to maybe contain errors?
All the examples note that you should run the drivers with sudo, this just does not work? pynq is not found and sourcing the xrt is just broken as sudo, is this an issue? Running example project via the correct pynq venv and making sure xrt is sourced correctly, alternatively running os command through jupyterlabs notebook works fine.

Thanks in advance,
Oscar

fpjentzsch · 2025-06-02T09:01:57Z

fpjentzsch
Jun 2, 2025
Collaborator

Hi,

FINN silently removes unsupported layers at the input- and output-end of the graph and expects you to implement those in software as pre-/post-processing. Often this will be the input quantization (if part of the model) and a floating point scale multiplication at the output. Maybe the same is happening here and it is not caught by simulation because that still executes the entire graph. You could check which layers actually get converted to HW CustomOps.
Too small FIFO sizes can only cause a throughput drop and in the worst case a deadlock. According to this comment you should always have at least a 2-depth FIFO in between (RTL) layers though.
The folding/packing should work, at least for integer and float types. It can be extremely slow though and there are open PRs to improve it.
You are right, but this should not be an issue for accelerator performance or functional correctness. If not running via Jupyter, I launch PYNQ drivers like this:

# Run as root and activate the PYNQ venv manually to use PYNQ outside of the typical Jupyter environment
sudo bash -c "source /etc/profile.d/pynq_venv.sh && export XILINX_XRT=/usr && python driver.py"

4 replies

OscarWahllof Jun 2, 2025
Author

Thank you for the response!

Forgive my basic understanding of ML as I am quite the beginner, I have a couple follow-up questions.

First some info. I train my model via Brevitas with input data generated synthetically with Matlab. I save that data as single fp. My aim is to try to format this data in such a way that it is similar to how some actual data would be available from an ADC. For several reasons 8bit is enough precision, thus I want my input to have 8bits. I can assume data is scaled coming from ADC appropriately in Q1.7 signed format. I.e. [-1,1-2^(-7)]. Of course, this format is just equivalent to just having 8bit signed integer with a weight or scale factor of 1/128, I.e. the 8bit patterns are the same. Thus I generate two sets of data, one float for machine learning training in aforementioned format, and one saved as just an appropriately scaled 8bit int. I train the model on the floating point data without any initial quantization. And then in my FINN model I introduce a div node by 128 as the initial step, and provide the int8 data as input instead of the float value needed by PyTorch. In simulation this yields identical results. It would not make much sense in practice to do this preprocessing in software.

Questions
4. Is above an appropriate way to do this? In cnv_end2end_example a div of 255 is used to bound input values to [0, 1] that should be quite similar. The streamline transformation just incorporates this division into a scaling of the thresholds of the first layer, so it is never actually explicitly performed anyway?

I should really be using VVAU instead of MVAU?
"You could check which layers actually get converted to HW CustomOps." Not sure what you mean. Just by inspection of the Netron, the only nodes present in the folded configuration that is called ZynqBuild on are:
FMPadding_rtl
ConvolutionInputGenerator_rtl
MVAU_hls
StreamingMaxPool_hls
MVAU_rtl
LabelSelect_hls
All of which should be standard library stuff? Running the cpp sim on this "final" folded onnx model still yield expected results. Picture of the dataflow network topology attached below.
Does FIFOs get included in the cpp sim? Or what things does get omitted?
How bad of a design choice is running thresholding with 8bit weights and 4bit activations with kernel size of maximum 13? Storing 15 thresholds of max 16bits doesn't seem entirely unreasonable? Although for each channel and MVAU it does add up.

fpjentzsch Jun 3, 2025
Collaborator

Intuitively it sounds right to me, at least if FINN still recognizes the INT8 input such that the first layer operates on an INT8 input and not a float input.
VVAU is only for depthwise convolutions. For normal 1D convolutions you should first apply Change3DTo4DTensors (which it seems you did) and then use the MVAU normally.
step_create_dataflow_partition will create a "/intermediate_models/dataflow_parent.onnx". Everything not inside the StreamingDataflowPartition node (which is a container for "/intermediate_models/supported_op_partitions/partition_0.onnx") will not be implemented by the accelerator. Often there will be at least an input Transpose node left, meaning you will need to convert from NCHW data layout (training) to NHWC (FINN) yourself before passing data to the accelerator.
cppsim is always node-by-node, so it shouldn't matter that it doesn't include FIFOs. The only simulation containing FIFOs would be stitched-ip rtlsim.
<=4 bit activations is the sweet spot for the FINN thresholding paradigm. 8 bit weights could make the MVAUs (both weight storage and compute logic) quite large though and I would recommend using the RTL (DSP) MVAU implementation for this bit width (Multi-packed DSPs for MVU/VVU layers #1021).

OscarWahllof Jun 4, 2025
Author

I am doing the transformations manually via notebook and not via a build script. The image above is the dataflow partition that has folding configured and what is called Zynqbuild on, as well as the dataflow partition used for cpp sim. I have simulated the parent graph with cpp, and the above dataflow partition with the input appropriately transposed to NHWC format, packaged to a tensor: input_hw_sim = {"Transpose_0_out0": data_hw_sim.astype(np.int8)}. These both yield the same and the expected result.

Saving the data_hw_sim numpy array to file and providing it to the driver or setting everything up manually and calling the execute function does not produce the expected classifications on the hardware. I have inspected the partial simulation activations from: execute_onnx_and_make_model, and in the cpp simulation both parent and dataflow activations are the same.

I am thinking perhaps the problem lies in either how the data is written to memory from the numpy array? The format of the input provided to the driver is somehow incorrect? Perhaps somehow the hardware is not generated as expected with the int8 input? I have set: model.set_tensor_datatype(global_inp_name, DataType["INT8"]) as part of the transformation steps, and again, the cpp sim does seem to work with int8 inputs.

Is the next step just to do rtl sim and try to verify activation values to simulation results or what are some steps to troubleshoot this further?

fpjentzsch Jun 5, 2025
Collaborator

Then I would try node-by-node or stitched-ip rtlsim to narrow down where the issue is introduced.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Troubleshooting Accelerator Performance #1348

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Troubleshooting Accelerator Performance #1348

Uh oh!

OscarWahllof May 30, 2025

Replies: 1 comment · 4 replies

Uh oh!

fpjentzsch Jun 2, 2025 Collaborator

Uh oh!

Uh oh!

OscarWahllof Jun 2, 2025 Author

Uh oh!

fpjentzsch Jun 3, 2025 Collaborator

Uh oh!

Uh oh!

OscarWahllof Jun 4, 2025 Author

Uh oh!

fpjentzsch Jun 5, 2025 Collaborator

OscarWahllof
May 30, 2025

Replies: 1 comment 4 replies

fpjentzsch
Jun 2, 2025
Collaborator

OscarWahllof Jun 2, 2025
Author

fpjentzsch Jun 3, 2025
Collaborator

OscarWahllof Jun 4, 2025
Author

fpjentzsch Jun 5, 2025
Collaborator