-
Couldn't load subscription status.
- Fork 29
[AIEX] Add a latency-aware heuristic component to AIEWAWRegRewriter #682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: aie-public
Are you sure you want to change the base?
Conversation
| if (!Reg.isVirtual()) | ||
| continue; | ||
|
|
||
| auto OperandCycle = ItinData->getOperandCycle(SchedClass, I); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move next to use?
| IsHighLat = true; | ||
| } | ||
|
|
||
| if (IsHighLat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps even return <closed expression>
The original goal of this PR was Conv2D_DW_bf16, but I saw that it also affected Gemm. In this way, I decided to keep those kernels on the heuristic radar. This whole work was based on assembly code analysis (including the cutoff heuristic). |
|
|
||
| // Only consider to front physical registers that also | ||
| // used by high latency operands. | ||
| if (HighLatencyRegs.count(AssignedPhysReg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you see here, we reverse the MO ordering by pushing front. This is not by mistake, this helps a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this as a source code comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK: We push them to the front so that they get replaced first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rewrite the comment to
High latency registers should be renamed first, therefore insert them at the front
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also rewrite for clarity:
const auto InsertPoint = HighLatencyRegs.count(AssignedPhysReg) ? Candidates.begin() : Candidates.end();
Candidates.emplace(InsertPoint , &MO, AssignedPhysReg);
| } | ||
| } | ||
|
|
||
| // This metric gives us an idea about the "demand" for high latency registers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ratio regs/instructions.
| MCRegister AssignedPhysReg = VRM->getPhys(Reg); | ||
| Candidates.emplace_back(&MO, AssignedPhysReg); | ||
|
|
||
| // Only consider to front physical registers that also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to put in front registers that are also used
| getLastVRegDef(const MachineBasicBlock &MBB) const; | ||
|
|
||
| std::set<MCRegister> | ||
| getHighOutputLatencyRegs(const MachineBasicBlock *MBB) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a comment.
Is this the total live range of a physical register or on a single Live Interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we looking here at the physical register mapped to a single virtual register def with a high output latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One MCRegister can also could be mapped to more than one VReg (defs), provided that at least one is high latency.
|
|
||
| // Only consider to front physical registers that also | ||
| // used by high latency operands. | ||
| if (HighLatencyRegs.count(AssignedPhysReg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add this as a source code comment
|
|
||
| // Only consider to front physical registers that also | ||
| // used by high latency operands. | ||
| if (HighLatencyRegs.count(AssignedPhysReg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECK: We push them to the front so that they get replaced first?
| bool IsHighLat = false; | ||
| if (OperandCycle.has_value()) { | ||
| if (OperandCycle.value() >= MinRegisterLatency) | ||
| IsHighLat = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add brackets here to differentiate between the !OperandCycle.has_value() case
| if (OperandCycle.has_value()) { | ||
| if (OperandCycle.value() >= MinRegisterLatency) | ||
| IsHighLat = true; | ||
| } else if (MI.mayLoad()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't we always have IsHighLat==true when we encounter a load instruction?
I think we could simplify the conditions to :
auto SetIsHighLat = [&ItinData, &MinRegisterLatency](const MachineInstr &) {
if (MI.mayLoad())
return true;
auto OperandCycle = ItinData->getOperandCycle(SchedClass, I);
if (!OperandCycle)
return false;
return OperandCycle.value() >= MinRegisterLatency;
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion, I will just trust the load itinerary before, because we insert less registers in the set and this could bias the cutoff.
|
|
||
| auto *ItinData = MF->getSubtarget().getInstrItineraryData(); | ||
| std::set<MCRegister> HighLatRegisters; | ||
| unsigned NInstrs = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why don't we querry MBB.size() ?
| // latency-aware heuristic. | ||
| // TODO: this should be replaced by more stable metrics related to SWP. | ||
| if (!HighLatRegisters.empty() && | ||
| (((HighLatRegisters.size() * 100 / NInstrs)) < 250 /*calibrated value*/)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for readability add a nice name for (HighLatRegisters.size() * 100 / NInstrs). Something like: HighLatencyRegisterInstrRatio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HighLatencyRegisterCountToInstrCountRatioInPercent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the name according @F-Stuckmann and extended the comment to accommodate @martien-de-jong observation.
| cl::init(3)); | ||
|
|
||
| static cl::opt<bool> | ||
| LatencyAware("aie-realloc-latencyaware", cl::Hidden, cl::init(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in the last commit set the option to true, since with the cutoff it is now profitable.
65f77fc to
63bbf0d
Compare
| getLastVRegDef(const MachineBasicBlock &MBB) const; | ||
|
|
||
| std::set<MCRegister> | ||
| getHighOutputLatencyRegs(const MachineBasicBlock *MBB) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we looking here at the physical register mapped to a single virtual register def with a high output latency?
|
|
||
| // Only consider to front physical registers that also | ||
| // used by high latency operands. | ||
| if (HighLatencyRegs.count(AssignedPhysReg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rewrite the comment to
High latency registers should be renamed first, therefore insert them at the front
|
|
||
| // Only consider to front physical registers that also | ||
| // used by high latency operands. | ||
| if (HighLatencyRegs.count(AssignedPhysReg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also rewrite for clarity:
const auto InsertPoint = HighLatencyRegs.count(AssignedPhysReg) ? Candidates.begin() : Candidates.end();
Candidates.emplace(InsertPoint , &MO, AssignedPhysReg);
| if (!Reg.isVirtual()) | ||
| continue; | ||
|
|
||
| auto IsHighLatInstrOperand = [&]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity. Why is this lambda not outside of the for loops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, my goal was to have a lazy evaluation boolean where we can state the conditions very explicitly. It is inside so we can capture everything instead of using parameters (clean if statement).
63bbf0d to
6da3b15
Compare
| ; CHECK-NEXT: vlda.conv.fp32.bf16 cml4, [p0], m1 | ||
| ; CHECK-NEXT: vlda.conv.fp32.bf16 cml2, [p1], m0 | ||
| ; CHECK-NEXT: vlda.conv.fp32.bf16 cml0, [p1], m0; movx r1, #60 | ||
| ; CHECK-NEXT: vlda.conv.fp32.bf16 cml1, [p0], m1; add.nc lc, r0, #-3; vadd.f dm3, dm0, dm1, r1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reach one more stage now.
6da3b15 to
e213349
Compare
…istic This commit adds an end-to-end test that fails to be scheduled (post-swp) optimally because of the actual register allocation. This regression is introduced by the latency-aware WAWRegRewriter strategy.
Ideally, we should extend with SWP metrics.
401e86b to
30888fc
Compare
|
What happens with stack and PM ? |
I added to the previous comment. |





This PR brings optimality in terms of pipelined loops for several important kernels, however with some degradation on the other side. Que next question is, how to decide when we can enable it? It should be on a per-loop basis.