-
Notifications
You must be signed in to change notification settings - Fork 121
Parallelize Router Preprocessing #1316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 2025.2.0
Are you sure you want to change the base?
Conversation
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
clavin-xlnx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving this! Looks like there might be some regression failures you'll need to sort out.
…own in threads Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
|
This is passing test cases locally but is now dependent on an update to the rapidwright api. |
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
|
Ok I think this should be in a reasonable place now. There's definitely some more performance that could be squeezed out by adding thread-safe APIs to design and net instead of the coarse grain synchronization I have now but experimentally this seems to only create a few seconds of overhead at most. Combined with a few additional optimizations the total runtime is now around 65-70s (~3.5x speedup). |
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
Signed-off-by: Andrew Butt <[email protected]>
clavin-xlnx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked over the code and don't have any major comments. But I think I will defer to @eddieh-xlnx as I think he might be able to provide more substantial feedback on the parallelization techniques.
| Cell c = cellCache.containsKey(p) ? cellCache.get(p) : design.getCell(p.getFullHierarchicalInstName()); | ||
| if (!cellCache.containsKey(p)) { | ||
| cellCache.put(p, c); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look at using computeIfAbsent here. Otherwise, you're doing 4 lookups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can you convince me that Map<EDIFHierPortInst, Cell> cellCache adds anything here? Since you're iterating over physPins, is it expected that you will ever lookup one of those pins more than once?
I could maybe see value in Map<EDIFHierCellInst, Cell>, but even then, how often does the same pin connect to the same cell? I'd need to see some data.
| String logicalPinName = p.getPortInst().getName(); | ||
| Set<String> physPinMappings = c.getAllPhysicalPinMappings(logicalPinName); | ||
| Set<String> physPinMappings; | ||
| synchronized (design.getCell(c.getName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Why do you need to synchronize on the cell here? I don't think any pin mappings need to be modified in this method, nor do cells need to be created.
- Why is it necessary to lookup the cell again here?
| while (!futures.isEmpty()) { | ||
| ParallelismTools.joinFirst(futures); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joinFirst() is for when you want to get the result from the first Future that completes so that it can be operated on. You want this instead:
| while (!futures.isEmpty()) { | |
| ParallelismTools.joinFirst(futures); | |
| } | |
| ParallelismTools.join(futures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably need to change futures to a List too.
| * @return | ||
| */ | ||
| @SuppressWarnings("unchecked") | ||
| public static Map<SiteInst, Map<Net, List<String>>> getSiteInstToNetSiteWiresMap(Design design) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this is necessary. My quick take is that this Map is only used by each Net (i.e. createMissingSitePinInsts(Design, Net, ...), getRoutedSitePinFromPhysicalPin(Cell, Net, ...), getAllRoutedSitePinsFromPhysicalPin(Cell, Net, ...)) which indicates to me that a global map (keyed on SiteInst) is not the best thing to do.
|
|
||
| if (parentPhysNet == null) { | ||
| synchronized (design) { | ||
| if (!net.rename(parentHierNet.getHierarchicalNetName())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on how many things you have to rename (admittedly, there shouldn't be many), this might be quite slow as you're essentially enforcing an atomic section here on the design singleton. It's likely better to save everything you need to rename into a concurrent collection (e.g. ConcurrentLinkedQueue) so that the actual renaming can happen outside of the parallel section.
Same for the movePinstToNewNetDeleteOldNet() call below.
For some large designs router preprocessing takes longer than actually routing the design. This PR parallelizes the bottlenecks of router preprocessing where possible. Before these changes preprocess takes 243 seconds on a design that nearly fills 2 SLRs on the V80, and after these changes preprocess takes 74 seconds (a 3.3x speedup). Performance comparison run on my VDI with 8 threads.