-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MachineLICM to hoist instructions with constant inputs #220
Conversation
llvm/lib/CodeGen/MachineLICM.cpp
Outdated
@@ -356,7 +356,8 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) { | |||
MRI = &MF.getRegInfo(); | |||
SchedModel.init(&ST); | |||
|
|||
PreRegAlloc = MRI->isSSA(); | |||
PreRegAlloc = !MF.getProperties().hasProperty( | |||
MachineFunctionProperties::Property::NoVRegs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tbh one should never redefine PreRegAlloc
, there's a reason why there is a MachineLICM
and EarlyMachineLICM
pass. But they don't really matter becausePreRegAlloc
is redefined anyway.
This diff is more of a band-aid to make MIR tests easy to write, because the MIRParser considers MIR as SSA if it has absolutely no vreg, which is unfortunate.
21d4886
to
8d97183
Compare
@@ -381,6 +383,26 @@ class PropagateIncomingLatencies : public ScheduleDAGMutation { | |||
})) | |||
continue; | |||
|
|||
// Do not change the latency if the REG_SEQUENCE has one source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, this is exactly the situation that I had in mind when I looked to MacroFusion
, place REG_SEQUENCE
near to the user. Great!
auto HasExternalAndLocalSources = [&MBB, &MRI](const MachineInstr &MI) { | ||
return MI.isRegSequence() && MRI.isSSA() && MI.getNumOperands() > 3 && | ||
count_if(MI.uses(), [&MBB, &MRI](const MachineOperand &MO) { | ||
return MO.isReg() && MO.getReg().isVirtual() && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check: as we eliminate REG_SEQUENCEs as part of the de-ssa process, do we really need to check MRI.isSSA()
and MO.getReg().isVirtual()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, I'm just careful because that DAGMutator could potentially be run at any moment, and in the middle of the "de-ssa process", we might still have reg_sequence. But you're right, I'm probably way too cautious here :D
count_if(MI.uses(), [&MBB, &MRI](const MachineOperand &MO) { | ||
return MO.isReg() && MO.getReg().isVirtual() && | ||
MRI.getVRegDef(MO.getReg())->getParent() != &MBB; | ||
}) == 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some benchmarks, we can have something like this:
%529:eds = REG_SEQUENCE %47, %subreg.sub_mod, %50, %subreg.sub_dim_size, %53, %subreg.sub_dim_stride, %581, %subreg.sub_dim_count, %56, %subreg.sub_hi_dim_then_sub_dim_size, %59, %subreg.sub_hi_dim_then_sub_dim_stride, %582, %subreg.sub_hi_dim_then_sub_dim_count
%530:acc512, %531:ep, %532:edc, %533:edc = VLDA_3D_CONV_FP32_BF16 %579, %529 :: (load unknown-size from %ir.in_ptr2.0, align 32, !tbaa !4, !noalias !1655, addrspace 6)
The first 3 registers come from outside so this comparison == 1
will fail. However, for Add2D_bf16_1
this is a nice thing considering the final result.
In the case of Mul2d_b16_0
, this mutation leads to the opposite effect: REG_SEQUENCEs output as LC deps. If we, on the other hand, disable this mutation as a whole, we just have the lanes as LC deps and the MLICM can nicely hoist them.
With the mutation we have:
nopb ; vlda wl2, [sp, #-192]; nops ; nopxm ; nopv // 32-byte Folded Reload
vldb wl2, [p0, #96]
vmov wh2, wl0
vlda wl7, [p0, #64]; vldb wl10, [p1, #32]
vst wh2, [sp, #-96] // 32-byte Folded Spill
vlda wl9, [p1, #64]; vldb wl2, [p0, #32]
vst wh2, [sp, #-160]; vmov wh4, wl0 // 32-byte Folded Spill
vst wh2, [sp, #-32]; vmov wh10, wl0; vmul.f bmh5, x2, x8, r6 // 32-byte Folded Spill
vst wl2, [sp, #-128]; mov p4, p3; vmul.f bmh3, x3, x5, r6 // 32-byte Folded Spill
vlda.3d wl3, [p0], d0; vldb wl2, [p1, #96]; vst.conv.bf16.fp32 bmh0, [p4], #64; vmul.f bmh4, x4, x10, r6
vldb.3d wl5, [p1], d0; mov p5, p4; vmul.f bmh6, x6, x1, r6
vldb wl11, [p0, #64]; vst.conv.bf16.fp32 bmh1, [p5], #64
vst wl2, [sp, #-192]; mov p6, p2 // 32-byte Folded Spill
vlda wl8, [p0, #32]; vldb wl1, [p1, #32]; mov p2, p5
vlda wl2, [p1, #64]; vldb wl3, [p0, #96]; vst.conv.bf16.fp32 bmh2, [p2], #64
vlda wl5, [p1, #96]; vldb.3d wl4, [p0], d0; vst.conv.bf16.fp32 bmh3, [p5, #32]; vmov wh5, wl0
vldb.3d wl6, [p1], d0; vst.conv.bf16.fp32 bmh6, [p4, #32]
vst.conv.bf16.fp32 bmh4, [p3, #32]; vmul.f bmh7, x3, x5, r6
nop
vst.conv.bf16.fp32 bmh5, [p6, #32]
vst wl2, [sp, #-64] // 32-byte Folded Spill
mov p3, p2; vmul.f bmh0, x7, x9, r6
vlda wl10, [sp, #-64]; vmov wl6, wl8; vmul.f bmh2, x11, x2, r6 // 32-byte Folded Reload
.L_LEnd2:
nopb ; vlda wl4, [sp, #-128]; vst.conv.bf16.fp32 bmh7, [p3], #64; nopx ; vmov wl8, wl10; vmul.f bmh1, x4, x6, r6 // 32-byte Folded Reload
Without:
vldb wl3, [p0, #32]; nopxm
vlda wl5, [p0, #96]; vldb wl6, [p1, #32]
vlda wl2, [p0, #64]; vldb wl8, [p1, #96]
vlda.3d wl10, [p0], d0; vldb wl4, [p1, #64]
vldb.3d wl1, [p1], d0
vmul.f bmh5, x2, x4, r6
vmul.f bmh3, x3, x5, r6
vmul.f bmh4, x6, x10, r6
vmul.f bmh6, x8, x1, r6
nop
vlda wl1, [p0, #32]; vldb wl10, [p1, #32]; vst.conv.bf16.fp32 bmh2, [p3, #32]
vst.conv.bf16.fp32 bmh5, [p3], #64; vmul.f bmh7, x10, x1, r6
vst.conv.bf16.fp32 bmh0, [p3, #32]
vst.conv.bf16.fp32 bmh3, [p3], #64
vst.conv.bf16.fp32 bmh1, [p2, #32]; vldb wl6, [p0, #96]; mov p2, p3
vlda wl8, [p0, #64]; vldb wl10, [p1, #96]; vst.conv.bf16.fp32 bmh6, [p2], #64; vmul.f bmh1, x3, x6, r6
vldb wl1, [p1, #64]; vlda.3d wl3, [p0], d0; vst.conv.bf16.fp32 bmh4, [p3, #32]; nopx ; mov p3, p2; vmul.f bmh2, x5, x8, r6
.L_LEnd2:
vldb.3d wl5, [p1], d0; nopa ; vst.conv.bf16.fp32 bmh7, [p3], #64; nopx ; vmov wh4, wl0; vmul.f bmh0, x1, x10, r6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have experimented with different options, like allowing more than one external source. I indeed see some improvements in the whole Mul
family of benchmarks, but also regressions for others. In the end, it really seems down to luck (meaning: how the MachinePipeliner places the instructions), so I'd rather not touch the logic until we have a more solid plan.
Those BitVectors get expensive on targets like AMDGPU with thousands of registers, and RegAliasIterator is also expensive. We can move all liveness calculations to use RegUnits instead to speed it up for targets where RegAliasIterator is expensive, like AMDGPU. On targets where RegAliasIterator is cheap, this alternative can be a little more expensive, but I believe the tradeoff is worth it.
Fix regression introduced in d4b8b72
Reverts the behavior introduced by 770393b while keeping the refactored code. Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU.
8d97183
to
0c59352
Compare
note: we are lagging behind upstream by a couple of months, so i cherry-picked some commits from there to minimise conflicts. |
@@ -597,14 +599,6 @@ void MachineLICMBase::HoistRegionPostRA(MachineLoop *CurLoop, | |||
const MachineLoop *ML = MLI->getLoopFor(BB); | |||
if (ML && ML->getHeader()->isEHPad()) continue; | |||
|
|||
// Conservatively treat live-in's as an external def. | |||
// FIXME: That means a reload that're reused in successor block(s) will not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we removed this fixme (and extended the implementation accordingly), I think this can be well received by the community.
for (MCRegUnitIterator RUI(LoopLiveInReg, TRI); RUI.isValid(); ++RUI) { | ||
if (RUDefs.test(*RUI)) { | ||
RUClobbers.set(*RUI); | ||
LaneBitmask LiveInMask = LoopLI.LaneMask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check: we filter off some cases where other aliasing reg units are live, but with disjoint lanes. Maybe some comment to clarify could be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, we only account for the live reg units that are part of the lane mask
MRI.getVRegDef(MO.getReg())->getParent() != &MBB; | ||
}); | ||
const auto NumInternal = MI.getNumOperands() - 1 - (2 * NumExternal); | ||
return NumExternal == 1 && NumInternal >= 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, do we consider a subregister index of an internal value as accounting for NumInternal
? Should we divide NumInternal
by 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely clear on what would be the best heuristic tbh. As mentioned here #220 (comment) I have experimented a bit, but it is challenging to find something that consistently yields good results. The current code works pretty well (see results in PR description), and I'm a bit afraid to specialise the heuristic too much for our current benchmarks if we keep tweaking it. I would suggest leaving that basic heuristic intact, and tweak it in follow-up work as we see fit. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a heuristic, we can leave as is because it can give us a nice help. I also tried some experiments during the review, and I think it is good as is (I also know that is hard to tune it as well).
This PR extends MachineLICM in a very clever way. I left some minor comments, mostly for clarification. |
Both are based on MachineLICMBase, and the functionality there is "switched" based on a PreRegAlloc flag. This commit is simply about trusting the original value of that flag, instead of overwriting it based on MRI.isSSA(), which is un-reliable
d0ab739
to
180a1b7
Compare
llvm/lib/CodeGen/MachineLICM.cpp
Outdated
@@ -614,6 +608,16 @@ void MachineLICMBase::HoistRegionPostRA(MachineLoop *CurLoop, | |||
ProcessMI(&MI, RUDefs, RUClobbers, StoredFIs, Candidates, CurLoop); | |||
} | |||
|
|||
// Mark registers as clobbered if they are defined in the loop and also livein |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would describe this code as ; if they are livein and also defined in the loop
# "tie" the sources together. | ||
|
||
# We expect the REG_SEQUENCE for the load of %ir.in1 to be in the second stage, close to | ||
# its VMUL consumer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps say explicitly you don't want the VMUL in the steady state to read a PHI node, but rather the REG_SEQUENCE ("close to" in a SWP schedule is a bit ambiguous)
return NumExternal == 1 && NumInternal >= 1; | ||
}; | ||
if (OnlyLocalSources && HasExternalAndLocalSources(MI)) | ||
MoveLatToSuccessors = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can const-initialize MoveLatToSuccessors here.
This allows to hoist instructions using registers that are not re-defined in the loop. Previous MachineLICM basically could not hoist any instruction using register inputs.
This adds an MIR test specifically for the MachinePipeliner, and updates the existing Mul2D end-to-end test to actually use SWP.
This is now very careful about REG_SEQUENCE that have an external source. That source is likely to create a COPY during regalloc, and we need to be careful to ensure that copy can be later hoisted by LICM. See tests :)
180a1b7
to
da8d3d6
Compare
This mostly extends the existing post-RA LICM pass so that it actually does something about instructions with register inputs. I'll see if I can upstream those changes.
Then there is a DAGMutator change to give more opportunities to
MachineLICM
Better review commit by commit.
I'll check the 30% regression in ReLu_bfloat16 in more detail (it comes from extra spills). But even in this state the QoR is good.