[AIEX] Add a latency-aware heuristic component to AIEWAWRegRewriter #682

andcarminati · 2025-10-17T14:58:51Z

This PR brings optimality in terms of pipelined loops for several important kernels, however with some degradation on the other side. Que next question is, how to decide when we can enable it? It should be on a per-loop basis.

andcarminati · 2025-10-17T15:02:05Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+      if (!Reg.isVirtual())
+        continue;
+
+      auto OperandCycle = ItinData->getOperandCycle(SchedClass, I);


Move next to use?

andcarminati · 2025-10-17T15:02:43Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+        IsHighLat = true;
+      }
+
+      if (IsHighLat) {


Early return?

perhaps even return <closed expression>

andcarminati · 2025-10-20T14:12:45Z

With the new cutoff commit:

Improvement: 1.11% avg.

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

krishnamtibrewala · 2025-10-20T17:26:40Z

With the new cutoff commit:

Improvement: 1.11% avg.

Curious to know how was the cut-off decided, and did it lead to reduction in perf, on benchmark where we were getting better performance without the cut-off logic

andcarminati · 2025-10-21T07:24:27Z

With the new cutoff commit:
Improvement: 1.11% avg.

Curious to know how was the cut-off decided, and did it lead to reduction in perf, on benchmark where we were getting better performance without the cut-off logic

The original goal of this PR was Conv2D_DW_bf16, but I saw that it also affected Gemm. In this way, I decided to keep those kernels on the heuristic radar. This whole work was based on assembly code analysis (including the cutoff heuristic).

andcarminati · 2025-10-21T07:26:05Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+
+        // Only consider to front physical registers that also
+        // used by high latency operands.
+        if (HighLatencyRegs.count(AssignedPhysReg))


If you see here, we reverse the MO ordering by pushing front. This is not by mistake, this helps a lot!

add this as a source code comment

CHECK: We push them to the front so that they get replaced first?

nit: Rewrite the comment to
High latency registers should be renamed first, therefore insert them at the front

Maybe also rewrite for clarity:

const auto InsertPoint = HighLatencyRegs.count(AssignedPhysReg) ? Candidates.begin() : Candidates.end(); Candidates.emplace(InsertPoint , &MO, AssignedPhysReg);

andcarminati · 2025-10-21T07:27:25Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+    }
+  }
+
+  // This metric gives us an idea about the "demand" for high latency registers


Ratio regs/instructions.

martien-de-jong · 2025-10-21T14:48:25Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

        MCRegister AssignedPhysReg = VRM->getPhys(Reg);
-        Candidates.emplace_back(&MO, AssignedPhysReg);
+
+        // Only consider to front physical registers that also


nit: to put in front registers that are also used

F-Stuckmann · 2025-10-22T11:14:20Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

  getLastVRegDef(const MachineBasicBlock &MBB) const;
+
+  std::set<MCRegister>
+  getHighOutputLatencyRegs(const MachineBasicBlock *MBB) const;


nit: add a comment.
Is this the total live range of a physical register or on a single Live Interval?

are we looking here at the physical register mapped to a single virtual register def with a high output latency?

One MCRegister can also could be mapped to more than one VReg (defs), provided that at least one is high latency.

F-Stuckmann · 2025-10-22T11:30:02Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+
+        // Only consider to front physical registers that also
+        // used by high latency operands.
+        if (HighLatencyRegs.count(AssignedPhysReg))


add this as a source code comment

F-Stuckmann · 2025-10-22T11:32:12Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+
+        // Only consider to front physical registers that also
+        // used by high latency operands.
+        if (HighLatencyRegs.count(AssignedPhysReg))


CHECK: We push them to the front so that they get replaced first?

F-Stuckmann · 2025-10-22T11:38:02Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+      bool IsHighLat = false;
+      if (OperandCycle.has_value()) {
+        if (OperandCycle.value() >= MinRegisterLatency)
+          IsHighLat = true;


nit: add brackets here to differentiate between the !OperandCycle.has_value() case

F-Stuckmann · 2025-10-22T11:43:51Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+      if (OperandCycle.has_value()) {
+        if (OperandCycle.value() >= MinRegisterLatency)
+          IsHighLat = true;
+      } else if (MI.mayLoad()) {


nit: don't we always have IsHighLat==true when we encounter a load instruction?
I think we could simplify the conditions to :

auto SetIsHighLat = [&ItinData, &MinRegisterLatency](const MachineInstr &) { if (MI.mayLoad()) return true; auto OperandCycle = ItinData->getOperandCycle(SchedClass, I); if (!OperandCycle) return false; return OperandCycle.value() >= MinRegisterLatency; };

Nice suggestion, I will just trust the load itinerary before, because we insert less registers in the set and this could bias the cutoff.

F-Stuckmann · 2025-10-22T11:46:05Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp


  auto *ItinData = MF->getSubtarget().getInstrItineraryData();
  std::set<MCRegister> HighLatRegisters;
+  unsigned NInstrs = 0;


nit: why don't we querry MBB.size() ?

F-Stuckmann · 2025-10-22T11:48:37Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+  // latency-aware heuristic.
+  // TODO: this should be replaced by more stable metrics related to SWP.
+  if (!HighLatRegisters.empty() &&
+      (((HighLatRegisters.size() * 100 / NInstrs)) < 250 /*calibrated value*/))


nit: for readability add a nice name for (HighLatRegisters.size() * 100 / NInstrs). Something like: HighLatencyRegisterInstrRatio

HighLatencyRegisterCountToInstrCountRatioInPercent

I changed the name according @F-Stuckmann and extended the comment to accommodate @martien-de-jong observation.

F-Stuckmann · 2025-10-22T11:54:35Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+                       cl::init(3));
+
+static cl::opt<bool>
+    LatencyAware("aie-realloc-latencyaware", cl::Hidden, cl::init(false),


nit: in the last commit set the option to true, since with the cutoff it is now profitable.

F-Stuckmann · 2025-10-24T10:02:23Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

  getLastVRegDef(const MachineBasicBlock &MBB) const;
+
+  std::set<MCRegister>
+  getHighOutputLatencyRegs(const MachineBasicBlock *MBB) const;


are we looking here at the physical register mapped to a single virtual register def with a high output latency?

F-Stuckmann · 2025-10-24T10:03:26Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+
+        // Only consider to front physical registers that also
+        // used by high latency operands.
+        if (HighLatencyRegs.count(AssignedPhysReg))


nit: Rewrite the comment to
High latency registers should be renamed first, therefore insert them at the front

F-Stuckmann · 2025-10-24T10:04:53Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+
+        // Only consider to front physical registers that also
+        // used by high latency operands.
+        if (HighLatencyRegs.count(AssignedPhysReg))


Maybe also rewrite for clarity:

const auto InsertPoint = HighLatencyRegs.count(AssignedPhysReg) ? Candidates.begin() : Candidates.end(); Candidates.emplace(InsertPoint , &MO, AssignedPhysReg);

F-Stuckmann · 2025-10-24T10:06:01Z

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

+      if (!Reg.isVirtual())
+        continue;
+
+      auto IsHighLatInstrOperand = [&]() {


out of curiosity. Why is this lambda not outside of the for loops?

In this case, my goal was to have a lazy evaluation boolean where we can state the conditions very explicitly. It is inside so we can capture everything instead of using parameters (clean if statement).

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp

andcarminati · 2025-10-24T12:30:04Z

llvm/test/CodeGen/AIE/aie2p/end-to-end/add-att-broadcasting.ll

+; CHECK-NEXT:    vlda.conv.fp32.bf16 cml4, [p0], m1
+; CHECK-NEXT:    vlda.conv.fp32.bf16 cml2, [p1], m0
+; CHECK-NEXT:    vlda.conv.fp32.bf16 cml0, [p1], m0; movx r1, #60
+; CHECK-NEXT:    vlda.conv.fp32.bf16 cml1, [p0], m1; add.nc lc, r0, #-3; vadd.f dm3, dm0, dm1, r1


We can reach one more stage now.

andcarminati · 2025-10-24T14:30:51Z

Complete QoR view:

(PM increase when we increase the number of SWP stages)

…istic This commit adds an end-to-end test that fails to be scheduled (post-swp) optimally because of the actual register allocation. This regression is introduced by the latency-aware WAWRegRewriter strategy.

Ideally, we should extend with SWP metrics.

F-Stuckmann · 2025-10-24T15:14:20Z

What happens with stack and PM ?

andcarminati · 2025-10-24T15:31:52Z

What happens with stack and PM ?

I added to the previous comment.

andcarminati requested review from F-Stuckmann, SagarMaheshwari99, abhinay-anubola, abnikant, katerynamuts, khallouh, konstantinschwarz, martien-de-jong, niwinanto and stephenneuendorffer as code owners October 17, 2025 14:58

andcarminati commented Oct 17, 2025

View reviewed changes

krishnamtibrewala reviewed Oct 20, 2025

View reviewed changes

llvm/lib/Target/AIE/AIEWawRegRewriter.cpp Show resolved Hide resolved

andcarminati commented Oct 21, 2025

View reviewed changes

martien-de-jong reviewed Oct 21, 2025

View reviewed changes

F-Stuckmann reviewed Oct 22, 2025

View reviewed changes

andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch 2 times, most recently from 65f77fc to 63bbf0d Compare October 24, 2025 07:25

F-Stuckmann reviewed Oct 24, 2025

View reviewed changes

andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch from 63bbf0d to 6da3b15 Compare October 24, 2025 12:29

andcarminati commented Oct 24, 2025

View reviewed changes

andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch from 6da3b15 to e213349 Compare October 24, 2025 13:46

andcarminati added 2 commits October 24, 2025 08:59

[AIE2P] Add end-to-end tests related to PostSWP and WAW reg rewrite

17534c3

[AIE2P] Add base test for AIEWawRegRewriter latency-aware strategy

984569f

andcarminati added 4 commits October 24, 2025 08:59

[AIEX] Add a latency-aware heuristic component to AIEWAWRegRewriter

c6b6b8c

[AIE2P] Add base test for the latency-aware WRWRegRewrite cutoff heur…

bd892b9

…istic This commit adds an end-to-end test that fails to be scheduled (post-swp) optimally because of the actual register allocation. This regression is introduced by the latency-aware WAWRegRewriter strategy.

[AIEX] Add temporary heuristic to cutoff some unprofitable cases

eebbb42

Ideally, we should extend with SWP metrics.

[AIEX] Enable AIEWAWRegRewriter latency-aware heuristic

30888fc

andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch from 401e86b to 30888fc Compare October 24, 2025 14:59

F-Stuckmann mentioned this pull request Oct 27, 2025

[AIE2P] Allow spilling vector/accumulator registers to registers instead of memory #688

Draft

Uh oh!

[AIEX] Add a latency-aware heuristic component to AIEWAWRegRewriter #682

Are you sure you want to change the base?

[AIEX] Add a latency-aware heuristic component to AIEWAWRegRewriter #682

Uh oh!

Conversation

andcarminati commented Oct 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andcarminati commented Oct 20, 2025

Uh oh!

Uh oh!

krishnamtibrewala commented Oct 20, 2025

Uh oh!

andcarminati commented Oct 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martien-de-jong Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andcarminati Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andcarminati commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

martien-de-jong Oct 22, 2025 •

edited

Loading

andcarminati Oct 24, 2025 •

edited

Loading

andcarminati commented Oct 24, 2025 •

edited

Loading