Skip to content

[STA] Post-Implementation STA Support for Dedicated Clock Network Modeling #3027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
AlexandreSinger opened this issue May 7, 2025 · 7 comments

Comments

@AlexandreSinger
Copy link
Contributor

AlexandreSinger commented May 7, 2025

We currently have an error condition when generating the post-implementation SDC file for STA:

// Ideal and routed clocks are handled by the code below. Other clock models
// like dedicated routing are not supported yet.
// TODO: Supporting dedicated routing should be simple; however it should
// be investigated. Tried quickly but found that the delays produced
// were off by 0.003 ns. Need to investigate why.
if (clock_modeling != e_clock_modeling::ROUTED_CLOCK && clock_modeling != e_clock_modeling::IDEAL_CLOCK) {
VPR_FATAL_ERROR(VPR_ERROR_IMPL_NETLIST_WRITER,
"Only ideal and routed clock modeling are currentlt "
"supported for post-implementation SDC file generation");
}

See PR #3016 for context.

This exists because the critical path delay reported by Tatum did not match the critical path delay reported by OpenSTA. I took the CLMA example from the OpenSTA with VTR tutorial and put it on the k6_frac_N10_frac_chain_mem32K_htree0_40nm.xml architecture (and changed the clock model accordingly):

Raw reports:

report_timing.setup.txt

open_sta_report_timing.setup.txt

Looking at the critical paths:

Tatum:
Image

OpenSTA:
Image

(Note: the arrival path was identical for both reports)

Notice that the clock network delays are off by 0.003ns. The reason why this very small discrepency caused me to block the feature is because all timing paths are off by 0.003ns. This leads me to think there is something wrong with how the SDF files are interpretted by OpenSTA vs Tatum.

Looking at the SDF delay annotation for the interconnect going into the clock of latch_ni33 (termination of the critical path of the design), we see the following:

Image

For some reason OpenSTA is choosing to use the max and min path delays for the clock network delay, while Tatum appears to be using the max path delays. Since the difference between the max and min delays on the interconnect is 0.003 ns, that is why we have the discrepency.

Based on what I am seeing, Tatum seems to be using the max delays on the clock interconnect for both the launching register and the capturing register; while OpenSTA uses the max delay for the launching register and the min delay for the capturing register.

@vaughnbetz What do you think could have caused this? I have actually ran into this discrepency before when the input pins have different max and min delays for clocks; however I thought it was an architecture file issue. But looking at these results this makes sense. Shouldn't we be using the min delay on the required time to be pessimistic, since we should assume that the capturing FF is capturing at the earliest possible point?

@vaughnbetz
Copy link
Contributor

vaughnbetz commented May 8, 2025

That sounds like a bug in Tatum; we should be using the minimum clock delay on the required path, as you say. @kmurray : if you've got some time tomorrow I can come by and chat about this ... I think somewhere we're taking a max when we should take a min on the required time path.

The path listings look pretty clear to me. I assume the min/max values you output in the sdf come from the same delay calculator that Tatum is using, which would further narrow things down (to a tatum probably rather than a delay annotation difference).

@AlexandreSinger
Copy link
Contributor Author

AlexandreSinger commented May 8, 2025

@vaughnbetz I did have a quick discussion with @petergrossmann21 today. Another potential cause of this is what we discussed in the VTR meeting a few weeks ago. It could be that one wire cannot have a min delay while the other wire has a max delay at the same time; they must both be either max or min.

For example, suppose we had a single wire segment which had a min delay and max delay, which then split into two wires each with only a single delay (different delays each) which went to the two registers we care about. The paths to each register would have max and min delays; however they would either both have the max delay or both have the min delay.

When we generate the post-implementation netlist, delays are annotated onto generated fpga_interconnect which go from source ports to target ports. The post-implementation netlist, however, does not have any modules for the input ports, so the delay of the input ports combine with the wires. We can find our 0.003 ns delay difference in the architectural description of the IO block:
Image
Here, the input pad is provided a shared connection with a max/min delay which splits into our registers.

For your reference here is a sample of the netlist being generated:

Image Image Image Image

As far as I can tell, all routes from clocks to FFs follow this pattern.

The thing that confuses me is that we are getting these timing edges from Tatum, and Tatum is telling us (as far as I can tell) that the delays from the input pins to the clock inputs have different max and min delays; which implies to me that the input pad delays are combining with the routed wire delay in the Tatum timing graph. A somewhat simple solution would have been to create modules for the input and output pads, and attach these min/max delays to the pads; but I am not sure if the delays would annotate correctly.

This also opens a whole can of worms regarding fanout of nets during routing... I do not think a viable solution would be to create an fpga_interconnect instance for every routing wire segment...

@petergrossmann21
Copy link
Contributor

@AlexandreSinger in the current fpga_interconnect generation, how does the generation work for arbitrary route trees? do you get 1 interconnect instance per route tree branch or one per startpoint/endpoint pair? (if this doesn't make sense, imagine a net with fanout of 4 where the complete route for the net looks like a binary tree. Do you get 4 fpga_interconnect instances for such a tree or 7?). Either solution should be viable; the only difference is the values of the delay parameters for each fpga_interconnect instance.

@AlexandreSinger
Copy link
Contributor Author

@AlexandreSinger in the current fpga_interconnect generation, how does the generation work for arbitrary route trees? do you get 1 interconnect instance per route tree branch or one per startpoint/endpoint pair? (if this doesn't make sense, imagine a net with fanout of 4 where the complete route for the net looks like a binary tree. Do you get 4 fpga_interconnect instances for such a tree or 7?). Either solution should be viable; the only difference is the values of the delay parameters for each fpga_interconnect instance.

I think that the netlist generator code creates 1 fpga_interconnect instance per startpoint/endpoint pair. For example, the output signal of LUT "n_n1463" fans out to 6 sinks, so 6 fpga_interconnect instances are created for each of these signals:

Image

Each of these instances have SDF annotations for the delays due to routing from the src to each sink independently:

Image

@vaughnbetz
Copy link
Contributor

Ah, thanks for the extra detail Alex. This may be OK then. Having a clock path that shares components is a tricky analysis; often timing analyzers list some pessimism in the min/max clock spread they can claw back as common clock pessimism removal. OpenSTA has CCPR in its timing report, but it listed it as 0.

Tatum appears to be treating the common part of the clock path (the IO block) as having to be either at its max or its min, but not both, for these two paths. It is concluding that one chip at a very close in time point cannot have the same resource with a different min and max delay so it is using only one of those delays. OpenSTA is using both delays. If you assume the min and max are due to crosstalk, what openSTA predicts could happen (they are two different clock edges and could be subject to different crosstalk), but if you assume it is process variation it can't happen (either a wire or block is extra fast or extra slow, but not both). I think in Quartus we modeled crosstalk on clock networks as clock uncertainty/jitter, not as min/max on the clock network itself, so I think we were closer to what Tatum is doing (but this level of detail I am not 100% sure about).

@kmurray : let's chat tomorrow about this. Maybe we can document the Tatum assumptions somewhere in any case.

@petergrossmann21 : Alex is right that we just output an interconnect instance to model the whole routing connection. That means you will get minimal structural clock modeling where the min and max spread can be collapsed.

@AlexandreSinger
Copy link
Contributor Author

FYI, I also found that this was not just localized to dedicated clock network modeling. It looks like any architecture that has max and min delays on its input pads have this discrepancy. I just tried it on the tutorial demo and increased the precision and saw that it too was also off by 0.003 ns (since the architecture in the tutorial was based on the dedicated clock network architecture I tested). It looks like you were right Vaughn that the dedicated clock networks should work through this flow the same as the other clock models.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented May 8, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants