Per-tensor Thresholding layer #1350

jonGuti13 · 2025-05-30T15:14:14Z

jonGuti13
May 30, 2025

If I had a Thresholding layer with parameters (numChannels = 16, numSteps = 255), but all the threshold values were identical across channels for the same step number (i.e., only numSteps distinct values in total), would it be possible—either in an HLS implementation or in an RTL implementation—to synthesize only numSteps values instead of numChannels × numSteps?

From #1002, I understand that this wouldn't be possible, as it mentions that the Thresholding RTL module expects a per-channel scale. However, I'm wondering if this may have changed in more recent updates—has this kind of shared-threshold optimization been supported since then?

iksnagreb · 2025-06-02T09:06:01Z

iksnagreb
Jun 2, 2025

I am not sure about the implementation constraints (the RTL code), probably @preusser is most qualified to comment on these. However, I can see that this is kind of an obvious optimization to implement - to not generate per-channel parameter code if the parameters have only per-tensor granularity (and I think this should be possible somehow? also, the referenced PR only applies to the binary case where numSteps=1, or am I reading the code wrong?).

From a purely experimental point of view (some experiments we recently did), I can tell you that this broadcasting is not as bad as it seems: If there is some redundancy or structure in the distribution of threshold values which could be exploited, the synthesis, i.e., Vivado, usually picks this up. Even if the code generation produces numChannels × numSteps threshold values - if they all turn out to be the same, Vivado should produce some optimized implementation much closer to 1 x numSteps, in terms of resource utilization.

3 replies

jonGuti13 Jun 5, 2025
Author

Hi @iksnagreb,

First of all, thank you for the fast response.

On the one hand, in the experiments that I have tried, the number of synthesized BRAMs by Vivado does not change whether you have (numChannels x numSteps) different thresholding values or just (1 x numSteps) different thresholding values since, anyway, the first step in FINN convert_qonnx_to_finn.py#L86 generates a MultiThreshold node of shape (numChannels x numSteps).

The thresholding_rtl_97 shown in the first image has different thresholding values.
The resources (CLB LUTs, CLB Registers, CARRY8, F7 Muxes, F8 Muxes, BRAM) can be seen in the second image.

If instead I have a model where the thresholding values are the same for all the channels, the use of resources is esentially the same (third and fourth images), so I suspect that unfortunatelly Vivado is not exploting it.

.

On the other hand, I have been analyzing if MultiThreshold node accepts to have per-tensor thresholding values and, in fact, it supports it. The class to convert a quantized ReLU operation expressed in the QONNX dialect to FINN dialect can be found in qonnx_activation_handlers.py#L281. You can force both num_scale_channels and num_output_channels to be 1 without getting any error since the MultiThreshold node description expects that to be possible as indicated by ìs_global_threshold flag in convert_qonnx_to_finn.py#L86.

The problem arises, of course, when thresholding_rtl.py comes into play since channel_fold in thresholding_rtl.py#L223 should be 1 to make it work and ch in thresholding_rtl.py#L520 should also be 1. Manually forcing those values to be 1 removes the error, but it does not seem reasonable to me to do that and I don't expect the inference to work properly, but I have to check it when the synthesis process finishes.

I have not tested it yet with the HLS implementation.

iksnagreb Jun 5, 2025

Hm, OK, so our experiments only considered LUT-RAM based thresholds, not BRAMs - maybe putting thresholds into BRAMs hides these potential optimizations. You might also want to give it a try to force thresholds into LUT-RAMs - we seldomly observed benefits of using BRAMs with RTL thresholding (BRAMs are probably most interesting with low parallelism - which you kind of have with PE=1, but not sure if this is on purpose/your final configuration - while having many channels, where I am not sure if 128 is already considered "many") . Trying with the HLS implementation probably does not make sense, if properly configured the RTL variant is just so much better in terms of resource utilization - especially for larger bit widths such as your 8 bits.

jonGuti13 Jun 6, 2025
Author

Hi @iksnagreb,

I am going to do the following 4 tests (1 has already been done) to see if I am able to reduce the number of synthesized resources for the case of all the channels having the same threshold values, that is, 1xnumSteps granularity (1x255 in my case).

1.- MultiThreshold node in ONNX graph has (numChannels x numSteps) values. The 8 (outputWidth) .dat files contain repeated information. I let Vivado choose where to store the thresholds (it results in using BRAM primarily). This results in no resource optimization when compared to the case of having (numChannels x numSteps) granularity.
2.- MultiThreshold node in ONNX graph has (numChannels x numSteps) values. The 8 (outputWidth) .dat files contain repeated information. I force Vivado to put thresholds into LUT-RAMs (your suggestion). Since the synthesized circuits would be the same, we hope some kind of optimization is going to be applied.
3.- As explained above, I modify the existing code so that MultiThreshold node in ONNX graph appears as having (1 x numSteps) values. The 8 (outputWidth) .dat files contain information from just one channel. Consequently I have to modify thresholding.sv#L239 so that all the channels retrieve the threshold value from the same position (I don't know if I have to change anything else since I don't know much about System Verilog, so the help of @preusser would be highly appreciated). Of course, this would just work for PE = 1 (if everything works, I would address the case of PE > 1 afterwards)

I would expect then 8 BRAMs to be synthesized in the worst case scenario (the first one containing just 1 value, the second containing 2, ..., the eight containing 128 values that sum up to 255) if Vivado opted again for BRAMs.
4.- The last trial is going to be the same as 3 but forcing Vivado again to use LUT-RAMs.

I will come back to comment once I have the results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Per-tensor Thresholding layer #1350

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Per-tensor Thresholding layer #1350

Uh oh!

jonGuti13 May 30, 2025

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

iksnagreb Jun 2, 2025

Uh oh!

jonGuti13 Jun 5, 2025 Author

Uh oh!

iksnagreb Jun 5, 2025

Uh oh!

jonGuti13 Jun 6, 2025 Author

jonGuti13
May 30, 2025

Replies: 1 comment 3 replies

iksnagreb
Jun 2, 2025

jonGuti13 Jun 5, 2025
Author

jonGuti13 Jun 6, 2025
Author