MachineLICM to hoist instructions with constant inputs #220

gbossu · 2024-10-23T10:28:38Z

This mostly extends the existing post-RA LICM pass so that it actually does something about instructions with register inputs. I'll see if I can upstream those changes.

Then there is a DAGMutator change to give more opportunities to MachineLICM

Better review commit by commit.

| Core_Compute_Cycle_Count   | bfloat16      | Mul2d_bf16_0 | Scale_Add_0  | Scale_Add_1  | Mul2d_bf16_1 | InstanceNormPart1_aie2_bf16_0 | BatchNorm1d_aie2_bfloat16 | BatchNorm2D_1 | LayerNormC8Part1_aie2_bf16_0 | Conv2D_ReLU_int8_1 | int8         | BatchNorm2D_0 | Tanh_0       | BatchNorm1d_aie2_int8 | Tanh_1       | ThresholdedRelu_aie2_int8 | Add2D_1      | Sin_aie2_bf16 | Conv2D_ReLU_int8_0 | Softmax_1    | Elu_aie2_int8_0 | Conv2D_DW_bf16_0 | InstanceNormPart2_aie2_bf16_0 | ReduceMeanAxis_1_aie2_bf16 | ReduceMeanAxis_4_aie2_bf16 | Rsqrt_aie2_int8_0 | ReduceMeanAxis_2_aie2_bf16 | DilatedConv2D_1 | SigmoidTemplated_int8_0 | SigmoidTemplated_int8_1 | HardswishAsHardsigmoid_aie2_0 | Hardswish_aie2_0 | Sub_aie2_int8_0 | Sub_aie2_int8_0_ptr_interface | ReduceMeanAxis_5_aie2_bf16 | ReduceMeanAxis_6_aie2_bf16 | ReduceMeanAxis_3_aie2_bf16 | Add_aie2_0   | SubBroadcasting_aie2_int8_0 | SubBroadcasting_aie2_int8_0_ptr_interface | AddBroadcasting_aie2_0 | ReduceSumAxis_1_aie2_int8 | AddAttributeBroadcasting_aie2_int8 | SubAttributeBroadcasting_aie2_int8_0 | Sin_aie2_int8 | Conv2D_DW_1  | Conv2D_SV60  | Conv2D_FC_0  | GEMM_bf16_1  | Conv2D_0     |       | AvgPool2dVariant_aie2_bf16_1 | Conv2D_1     | ReduceProdAxis_4_aie2_bf16 | ReduceProdAxis_1_aie2_bf16 | ReduceProdAxis_2_aie2_bf16 | Mul2D_0      | Mul2D_1      | HardswishAsHardsigmoid_aie2_1 | Hardswish_aie2_1 | Erf_aie2_bf16_0 | ReduceProdAxis_5_aie2_bf16 | ReduceProdAxis_6_aie2_bf16 | ReduceProdAxis_3_aie2_bf16 | ReduceProdAxis_7_aie2_bf16 | TanhTemplated_aie2_bfloat16 | MulAttributeBroadcasting_aie2_int8_0 | SigmoidTemplated_bf16_0 | GELU_0        | MulBroadcasting_aie2_0 | GELU_1        | SiLU_aie2_bf16 | Mul_aie2_0    | HardSigmoid_bf16_1 | HardSigmoid_bf16_0 | MulBroadcastingBf16_aie2_0 | MulBf16_aie2_0 | MulAttributeBroadcasting_aie2_bf16_0 | Average diff |
| -------------------------- | ------------- | ------------ | ------------ | ------------ | ------------ | ----------------------------- | ------------------------- | ------------- | ---------------------------- | ------------------ | ------------ | ------------- | ------------ | --------------------- | ------------ | ------------------------- | ------------ | ------------- | ------------------ | ------------ | --------------- | ---------------- | ----------------------------- | -------------------------- | -------------------------- | ----------------- | -------------------------- | --------------- | ----------------------- | ----------------------- | ----------------------------- | ---------------- | --------------- | ----------------------------- | -------------------------- | -------------------------- | -------------------------- | ------------ | --------------------------- | ----------------------------------------- | ---------------------- | ------------------------- | ---------------------------------- | ------------------------------------ | ------------- | ------------ | ------------ | ------------ | ------------ | ------------ |       | ---------------------------- | ------------ | -------------------------- | -------------------------- | -------------------------- | ------------ | ------------ | ----------------------------- | ---------------- | --------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | --------------------------- | ------------------------------------ | ----------------------- | ------------- | ---------------------- | ------------- | -------------- | ------------- | ------------------ | ------------------ | -------------------------- | -------------- | ------------------------------------ | ------------ |
| Baseline                   | 907(+0.00%)   | 505(+0.00%)  | 367(+0.00%)  | 367(+0.00%)  | 321(+0.00%)  | 2882(+0.00%)                  | 387(+0.00%)               | 415(+0.00%)   | 8890(+0.00%)                 | 922(+0.00%)        | 846(+0.00%)  | 306(+0.00%)   | 1964(+0.00%) | 406(+0.00%)           | 2572(+0.00%) | 865(+0.00%)               | 434(+0.00%)  | 3009(+0.00%)  | 10145(+0.00%)      | 570(+0.00%)  | 578(+0.00%)     | 1175(+0.00%)     | 9456(+0.00%)                  | 13034(+0.00%)              | 13040(+0.00%)              | 2376(+0.00%)      | 13070(+0.00%)              | 5382(+0.00%)    | 1275(+0.00%)            | 1275(+0.00%)            | 1368(+0.00%)                  | 1368(+0.00%)     | 703(+0.00%)     | 703(+0.00%)                   | 7208(+0.00%)               | 7215(+0.00%)               | 7229(+0.00%)               | 725(+0.00%)  | 753(+0.00%)                 | 753(+0.00%)                               | 775(+0.00%)            | 7235(+0.00%)              | 806(+0.00%)                        | 806(+0.00%)                          | 841(+0.00%)   | 852(+0.00%)  | 857(+0.00%)  | 2647(+0.00%) | 7661(+0.00%) | 7687(+0.00%) |  ...  | 1783(+0.00%)                 | 2458(+0.00%) | 35954(+0.00%)              | 35922(+0.00%)              | 18052(+0.00%)              | 548(+0.00%)  | 548(+0.00%)  | 1590(+0.00%)                  | 1585(+0.00%)     | 2894(+0.00%)    | 9184(+0.00%)               | 9168(+0.00%)               | 9185(+0.00%)               | 1894(+0.00%)               | 1143(+0.00%)                | 581(+0.00%)                          | 1954(+0.00%)            | 2594(+0.00%)  | 358(+0.00%)            | 3426(+0.00%)  | 3608(+0.00%)   | 295(+0.00%)   | 966(+0.00%)        | 1434(+0.00%)       | 1174(+0.00%)               | 1119(+0.00%)   | 1555(+0.00%)                         | +0.00%       |
| MachineLICM changes        | 907(+0.00%)   | 505(+0.00%)  | 371(+1.09%)  | 371(+1.09%)  | 321(+0.00%)  | 2901(+0.66%)                  | 389(+0.52%)               | 417(+0.48%)   | 8930(+0.45%)                 | 926(+0.43%)        | 849(+0.35%)  | 307(+0.33%)   | 1970(+0.31%) | 407(+0.25%)           | 2578(+0.23%) | 867(+0.23%)               | 435(+0.23%)  | 3015(+0.20%)  | 10164(+0.19%)      | 571(+0.18%)  | 579(+0.17%)     | 1177(+0.17%)     | 9472(+0.17%)                  | 13056(+0.17%)              | 13062(+0.17%)              | 2380(+0.17%)      | 13092(+0.17%)              | 5391(+0.17%)    | 1277(+0.16%)            | 1277(+0.16%)            | 1370(+0.15%)                  | 1370(+0.15%)     | 704(+0.14%)     | 704(+0.14%)                   | 7218(+0.14%)               | 7225(+0.14%)               | 7239(+0.14%)               | 726(+0.14%)  | 754(+0.13%)                 | 754(+0.13%)                               | 776(+0.13%)            | 7244(+0.12%)              | 807(+0.12%)                        | 807(+0.12%)                          | 842(+0.12%)   | 853(+0.12%)  | 858(+0.12%)  | 2650(+0.11%) | 7669(+0.10%) | 7695(+0.10%) |  ...  | 1781(-0.11%)                 | 2450(-0.33%) | 35498(-1.27%)              | 35466(-1.27%)              | 17821(-1.28%)              | 548(+0.00%)  | 548(+0.00%)  | 1590(+0.00%)                  | 1585(+0.00%)     | 2894(+0.00%)    | 8730(-4.94%)               | 8709(-5.01%)               | 8722(-5.04%)               | 1794(-5.28%)               | 1051(-8.05%)                | 517(-11.02%)                         | 1954(+0.00%)            | 2144(-17.35%) | 294(-17.88%)           | 2811(-17.95%) | 3608(+0.00%)   | 231(-21.69%)  | 649(-32.82%)       | 937(-34.66%)       | 1174(+0.00%)               | 1119(+0.00%)   | 1555(+0.00%)                         | -0.49%       |
| DAGMutator changes         | 1217(+34.18%) | 519(+2.77%)  | 374(+0.81%)  | 374(+0.81%)  | 327(+1.87%)  | 2901(+0.00%)                  | 389(+0.00%)               | 417(+0.00%)   | 8930(+0.00%)                 | 926(+0.00%)        | 849(+0.00%)  | 307(+0.00%)   | 1970(+0.00%) | 407(+0.00%)           | 2578(+0.00%) | 867(+0.00%)               | 435(+0.00%)  | 3015(+0.00%)  | 10164(+0.00%)      | 571(+0.00%)  | 579(+0.00%)     | 1177(+0.00%)     | 9472(+0.00%)                  | 13056(+0.00%)              | 13062(+0.00%)              | 2380(+0.00%)      | 13092(+0.00%)              | 5391(+0.00%)    | 1277(+0.00%)            | 1277(+0.00%)            | 1370(+0.00%)                  | 1370(+0.00%)     | 704(+0.00%)     | 704(+0.00%)                   | 7218(+0.00%)               | 7225(+0.00%)               | 7239(+0.00%)               | 726(+0.00%)  | 754(+0.00%)                 | 754(+0.00%)                               | 776(+0.00%)            | 7244(+0.00%)              | 807(+0.00%)                        | 807(+0.00%)                          | 842(+0.00%)   | 853(+0.00%)  | 858(+0.00%)  | 2650(+0.00%) | 7669(+0.00%) | 7695(+0.00%) |  ...  | 1781(+0.00%)                 | 2450(+0.00%) | 35498(+0.00%)              | 35466(+0.00%)              | 17821(+0.00%)              | 533(-2.74%)  | 533(-2.74%)  | 1527(-3.96%)                  | 1522(-3.97%)     | 2770(-4.28%)    | 8730(+0.00%)               | 8709(+0.00%)               | 8722(+0.00%)               | 1794(+0.00%)               | 1050(-0.10%)                | 517(+0.00%)                          | 1633(-16.43%)           | 2144(+0.00%)  | 294(+0.00%)            | 2811(+0.00%)  | 2908(-19.40%)  | 231(+0.00%)   | 649(+0.00%)        | 937(+0.00%)        | 752(-35.95%)               | 697(-37.71%)   | 893(-42.57%)                         | -0.37%       |
| Total diff                 | REGR(+34.18%) | REGR(+2.77%) | REGR(+1.91%) | REGR(+1.91%) | REGR(+1.87%) | REGR(+0.66%)                  | REGR(+0.52%)              | REGR(+0.48%)  | REGR(+0.45%)                 | REGR(+0.43%)       | REGR(+0.35%) | REGR(+0.33%)  | REGR(+0.31%) | REGR(+0.25%)          | REGR(+0.23%) | REGR(+0.23%)              | REGR(+0.23%) | REGR(+0.20%)  | REGR(+0.19%)       | REGR(+0.18%) | REGR(+0.17%)    | REGR(+0.17%)     | REGR(+0.17%)                  | REGR(+0.17%)               | REGR(+0.17%)               | REGR(+0.17%)      | REGR(+0.17%)               | REGR(+0.17%)    | REGR(+0.16%)            | REGR(+0.16%)            | REGR(+0.15%)                  | REGR(+0.15%)     | REGR(+0.14%)    | REGR(+0.14%)                  | REGR(+0.14%)               | REGR(+0.14%)               | REGR(+0.14%)               | REGR(+0.14%) | REGR(+0.13%)                | REGR(+0.13%)                              | REGR(+0.13%)           | REGR(+0.12%)              | REGR(+0.12%)                       | REGR(+0.12%)                         | REGR(+0.12%)  | REGR(+0.12%) | REGR(+0.12%) | REGR(+0.11%) | REGR(+0.10%) | REGR(+0.10%) |       | IMPR(-0.11%)                 | IMPR(-0.33%) | IMPR(-1.27%)               | IMPR(-1.27%)               | IMPR(-1.28%)               | IMPR(-2.74%) | IMPR(-2.74%) | IMPR(-3.96%)                  | IMPR(-3.97%)     | IMPR(-4.28%)    | IMPR(-4.94%)               | IMPR(-5.01%)               | IMPR(-5.04%)               | IMPR(-5.28%)               | IMPR(-8.14%)                | IMPR(-11.02%)                        | IMPR(-16.43%)           | IMPR(-17.35%) | IMPR(-17.88%)          | IMPR(-17.95%) | IMPR(-19.40%)  | IMPR(-21.69%) | IMPR(-32.82%)      | IMPR(-34.66%)      | IMPR(-35.95%)              | IMPR(-37.71%)  | IMPR(-42.57%)                        | -0.87%       |

I'll check the 30% regression in ReLu_bfloat16 in more detail (it comes from extra spills). But even in this state the QoR is good.

gbossu · 2024-10-23T11:54:44Z

llvm/lib/CodeGen/MachineLICM.cpp

@@ -356,7 +356,8 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
  MRI = &MF.getRegInfo();
  SchedModel.init(&ST);

-  PreRegAlloc = MRI->isSSA();
+  PreRegAlloc = !MF.getProperties().hasProperty(
+      MachineFunctionProperties::Property::NoVRegs);


Tbh one should never redefine PreRegAlloc, there's a reason why there is a MachineLICM and EarlyMachineLICM pass. But they don't really matter becausePreRegAlloc is redefined anyway.

This diff is more of a band-aid to make MIR tests easy to write, because the MIRParser considers MIR as SSA if it has absolutely no vreg, which is unfortunate.

andcarminati · 2024-10-24T08:29:50Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

@@ -381,6 +383,26 @@ class PropagateIncomingLatencies : public ScheduleDAGMutation {
          }))
        continue;

+      // Do not change the latency if the REG_SEQUENCE has one source


Humm, this is exactly the situation that I had in mind when I looked to MacroFusion, place REG_SEQUENCE near to the user. Great!

andcarminati · 2024-10-24T08:42:40Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+      auto HasExternalAndLocalSources = [&MBB, &MRI](const MachineInstr &MI) {
+        return MI.isRegSequence() && MRI.isSSA() && MI.getNumOperands() > 3 &&
+               count_if(MI.uses(), [&MBB, &MRI](const MachineOperand &MO) {
+                 return MO.isReg() && MO.getReg().isVirtual() &&


check: as we eliminate REG_SEQUENCEs as part of the de-ssa process, do we really need to check MRI.isSSA() and MO.getReg().isVirtual()?

Probably not, I'm just careful because that DAGMutator could potentially be run at any moment, and in the middle of the "de-ssa process", we might still have reg_sequence. But you're right, I'm probably way too cautious here :D

andcarminati · 2024-10-24T11:25:43Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+               count_if(MI.uses(), [&MBB, &MRI](const MachineOperand &MO) {
+                 return MO.isReg() && MO.getReg().isVirtual() &&
+                        MRI.getVRegDef(MO.getReg())->getParent() != &MBB;
+               }) == 1;


For some benchmarks, we can have something like this:

%529:eds = REG_SEQUENCE %47, %subreg.sub_mod, %50, %subreg.sub_dim_size, %53, %subreg.sub_dim_stride, %581, %subreg.sub_dim_count, %56, %subreg.sub_hi_dim_then_sub_dim_size, %59, %subreg.sub_hi_dim_then_sub_dim_stride, %582, %subreg.sub_hi_dim_then_sub_dim_count %530:acc512, %531:ep, %532:edc, %533:edc = VLDA_3D_CONV_FP32_BF16 %579, %529 :: (load unknown-size from %ir.in_ptr2.0, align 32, !tbaa !4, !noalias !1655, addrspace 6)

The first 3 registers come from outside so this comparison == 1 will fail. However, for Add2D_bf16_1 this is a nice thing considering the final result.

In the case of Mul2d_b16_0, this mutation leads to the opposite effect: REG_SEQUENCEs output as LC deps. If we, on the other hand, disable this mutation as a whole, we just have the lanes as LC deps and the MLICM can nicely hoist them.

With the mutation we have:

nopb ; vlda wl2, [sp, #-192]; nops ; nopxm ; nopv // 32-byte Folded Reload vldb wl2, [p0, #96] vmov wh2, wl0 vlda wl7, [p0, #64]; vldb wl10, [p1, #32] vst wh2, [sp, #-96] // 32-byte Folded Spill vlda wl9, [p1, #64]; vldb wl2, [p0, #32] vst wh2, [sp, #-160]; vmov wh4, wl0 // 32-byte Folded Spill vst wh2, [sp, #-32]; vmov wh10, wl0; vmul.f bmh5, x2, x8, r6 // 32-byte Folded Spill vst wl2, [sp, #-128]; mov p4, p3; vmul.f bmh3, x3, x5, r6 // 32-byte Folded Spill vlda.3d wl3, [p0], d0; vldb wl2, [p1, #96]; vst.conv.bf16.fp32 bmh0, [p4], #64; vmul.f bmh4, x4, x10, r6 vldb.3d wl5, [p1], d0; mov p5, p4; vmul.f bmh6, x6, x1, r6 vldb wl11, [p0, #64]; vst.conv.bf16.fp32 bmh1, [p5], #64 vst wl2, [sp, #-192]; mov p6, p2 // 32-byte Folded Spill vlda wl8, [p0, #32]; vldb wl1, [p1, #32]; mov p2, p5 vlda wl2, [p1, #64]; vldb wl3, [p0, #96]; vst.conv.bf16.fp32 bmh2, [p2], #64 vlda wl5, [p1, #96]; vldb.3d wl4, [p0], d0; vst.conv.bf16.fp32 bmh3, [p5, #32]; vmov wh5, wl0 vldb.3d wl6, [p1], d0; vst.conv.bf16.fp32 bmh6, [p4, #32] vst.conv.bf16.fp32 bmh4, [p3, #32]; vmul.f bmh7, x3, x5, r6 nop vst.conv.bf16.fp32 bmh5, [p6, #32] vst wl2, [sp, #-64] // 32-byte Folded Spill mov p3, p2; vmul.f bmh0, x7, x9, r6 vlda wl10, [sp, #-64]; vmov wl6, wl8; vmul.f bmh2, x11, x2, r6 // 32-byte Folded Reload .L_LEnd2: nopb ; vlda wl4, [sp, #-128]; vst.conv.bf16.fp32 bmh7, [p3], #64; nopx ; vmov wl8, wl10; vmul.f bmh1, x4, x6, r6 // 32-byte Folded Reload

Without:

vldb wl3, [p0, #32]; nopxm vlda wl5, [p0, #96]; vldb wl6, [p1, #32] vlda wl2, [p0, #64]; vldb wl8, [p1, #96] vlda.3d wl10, [p0], d0; vldb wl4, [p1, #64] vldb.3d wl1, [p1], d0 vmul.f bmh5, x2, x4, r6 vmul.f bmh3, x3, x5, r6 vmul.f bmh4, x6, x10, r6 vmul.f bmh6, x8, x1, r6 nop vlda wl1, [p0, #32]; vldb wl10, [p1, #32]; vst.conv.bf16.fp32 bmh2, [p3, #32] vst.conv.bf16.fp32 bmh5, [p3], #64; vmul.f bmh7, x10, x1, r6 vst.conv.bf16.fp32 bmh0, [p3, #32] vst.conv.bf16.fp32 bmh3, [p3], #64 vst.conv.bf16.fp32 bmh1, [p2, #32]; vldb wl6, [p0, #96]; mov p2, p3 vlda wl8, [p0, #64]; vldb wl10, [p1, #96]; vst.conv.bf16.fp32 bmh6, [p2], #64; vmul.f bmh1, x3, x6, r6 vldb wl1, [p1, #64]; vlda.3d wl3, [p0], d0; vst.conv.bf16.fp32 bmh4, [p3, #32]; nopx ; mov p3, p2; vmul.f bmh2, x5, x8, r6 .L_LEnd2: vldb.3d wl5, [p1], d0; nopa ; vst.conv.bf16.fp32 bmh7, [p3], #64; nopx ; vmov wh4, wl0; vmul.f bmh0, x1, x10, r6

I have experimented with different options, like allowing more than one external source. I indeed see some improvements in the whole Mul family of benchmarks, but also regressions for others. In the end, it really seems down to luck (meaning: how the MachinePipeliner places the instructions), so I'd rather not touch the logic until we have a more solid plan.

Those BitVectors get expensive on targets like AMDGPU with thousands of registers, and RegAliasIterator is also expensive. We can move all liveness calculations to use RegUnits instead to speed it up for targets where RegAliasIterator is expensive, like AMDGPU. On targets where RegAliasIterator is cheap, this alternative can be a little more expensive, but I believe the tradeoff is worth it.

Fix regression introduced in d4b8b72

Reverts the behavior introduced by 770393b while keeping the refactored code. Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU.

gbossu · 2024-10-24T17:36:51Z

note: we are lagging behind upstream by a couple of months, so i cherry-picked some commits from there to minimise conflicts.

andcarminati · 2024-10-28T12:58:17Z

llvm/lib/CodeGen/MachineLICM.cpp

@@ -597,14 +599,6 @@ void MachineLICMBase::HoistRegionPostRA(MachineLoop *CurLoop,
    const MachineLoop *ML = MLI->getLoopFor(BB);
    if (ML && ML->getHeader()->isEHPad()) continue;

-    // Conservatively treat live-in's as an external def.
-    // FIXME: That means a reload that're reused in successor block(s) will not


As we removed this fixme (and extended the implementation accordingly), I think this can be well received by the community.

andcarminati · 2024-10-28T13:21:38Z

llvm/lib/CodeGen/MachineLICM.cpp

-    for (MCRegUnitIterator RUI(LoopLiveInReg, TRI); RUI.isValid(); ++RUI) {
-      if (RUDefs.test(*RUI)) {
-        RUClobbers.set(*RUI);
+    LaneBitmask LiveInMask = LoopLI.LaneMask;


check: we filter off some cases where other aliasing reg units are live, but with disjoint lanes. Maybe some comment to clarify could be helpful.

Exactly, we only account for the live reg units that are part of the lane mask

andcarminati · 2024-10-28T13:34:23Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+                     MRI.getVRegDef(MO.getReg())->getParent() != &MBB;
+            });
+        const auto NumInternal = MI.getNumOperands() - 1 - (2 * NumExternal);
+        return NumExternal == 1 && NumInternal >= 1;


Here, do we consider a subregister index of an internal value as accounting for NumInternal ? Should we divide NumInternal by 2?

I'm not completely clear on what would be the best heuristic tbh. As mentioned here #220 (comment) I have experimented a bit, but it is challenging to find something that consistently yields good results. The current code works pretty well (see results in PR description), and I'm a bit afraid to specialise the heuristic too much for our current benchmarks if we keep tweaking it. I would suggest leaving that basic heuristic intact, and tweak it in follow-up work as we see fit. What do you think?

As this is a heuristic, we can leave as is because it can give us a nice help. I also tried some experiments during the review, and I think it is good as is (I also know that is hard to tune it as well).

andcarminati · 2024-10-28T13:43:15Z

This PR extends MachineLICM in a very clever way. I left some minor comments, mostly for clarification.

Both are based on MachineLICMBase, and the functionality there is "switched" based on a PreRegAlloc flag. This commit is simply about trusting the original value of that flag, instead of overwriting it based on MRI.isSSA(), which is un-reliable

martien-de-jong · 2024-10-29T09:21:06Z

llvm/lib/CodeGen/MachineLICM.cpp

@@ -614,6 +608,16 @@ void MachineLICMBase::HoistRegionPostRA(MachineLoop *CurLoop,
      ProcessMI(&MI, RUDefs, RUClobbers, StoredFIs, Candidates, CurLoop);
  }

+  // Mark registers as clobbered if they are defined in the loop and also livein


nit: I would describe this code as ; if they are livein and also defined in the loop

martien-de-jong · 2024-10-29T09:50:27Z

llvm/test/CodeGen/AIE/aie2/schedule/swp/swp-regsequence.mir

+# "tie" the sources together.
+
+# We expect the REG_SEQUENCE for the load of %ir.in1 to be in the second stage, close to
+# its VMUL consumer.


Perhaps say explicitly you don't want the VMUL in the steady state to read a PHI node, but rather the REG_SEQUENCE ("close to" in a SWP schedule is a bit ambiguous)

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

martien-de-jong · 2024-10-29T10:03:02Z

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

+        return NumExternal == 1 && NumInternal >= 1;
+      };
+      if (OnlyLocalSources && HasExternalAndLocalSources(MI))
+        MoveLatToSuccessors = false;


I think we can const-initialize MoveLatToSuccessors here.

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp

This allows to hoist instructions using registers that are not re-defined in the loop. Previous MachineLICM basically could not hoist any instruction using register inputs.

This adds an MIR test specifically for the MachinePipeliner, and updates the existing Mul2D end-to-end test to actually use SWP.

This is now very careful about REG_SEQUENCE that have an external source. That source is likely to create a COPY during regalloc, and we need to be careful to ensure that copy can be later hoisted by LICM. See tests :)

gbossu requested review from abhinay-anubola, abnikant, andcarminati, khallouh, konstantinschwarz, martien-de-jong, SagarMaheshwari99 and stephenneuendorffer as code owners October 23, 2024 10:28

gbossu commented Oct 23, 2024

View reviewed changes

gbossu force-pushed the gaetan.licm.constant.regs branch from 21d4886 to 8d97183 Compare October 23, 2024 15:51

andcarminati reviewed Oct 24, 2024

View reviewed changes

Pierre-vh added 3 commits October 24, 2024 17:34

[MachineLICM] Correctly Apply Register Masks (#95746)

b9f375d

Fix regression introduced in d4b8b72

[MachineLICM] Work-around Incomplete RegUnits (#95926)

3b47796

Reverts the behavior introduced by 770393b while keeping the refactored code. Fixes a miscompile on AArch64, at the cost of a small regression on AMDGPU.

gbossu force-pushed the gaetan.licm.constant.regs branch from 8d97183 to 0c59352 Compare October 24, 2024 17:33

andcarminati reviewed Oct 28, 2024

View reviewed changes

gbossu added 2 commits October 28, 2024 14:30

[AIE2] Fix: run the right version of MachineLICM

e522fc9

gbossu force-pushed the gaetan.licm.constant.regs branch from d0ab739 to 180a1b7 Compare October 28, 2024 14:31

martien-de-jong reviewed Oct 29, 2024

View reviewed changes

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp Show resolved Hide resolved

martien-de-jong reviewed Oct 29, 2024

View reviewed changes

llvm/lib/Target/AIE/AIEBaseSubtarget.cpp Show resolved Hide resolved

martien-de-jong previously approved these changes Oct 29, 2024

View reviewed changes

gbossu added 7 commits October 29, 2024 10:48

[CodeGen] MachineLICM: Do not consider "loop liveins" as loop defs

3cf56f6

This allows to hoist instructions using registers that are not re-defined in the loop. Previous MachineLICM basically could not hoist any instruction using register inputs.

[AIE2] Update tests after MachineLICM change

d8ee417

[CodeGen] MachineLICM: Look at lane masks for loop liveins

05b6fe2

[AIE2] Update tests after LICM change

7a93613

[AIEX] MachinePipeliner: baseline tests for REG_SEQUENCE placement

dab3e95

This adds an MIR test specifically for the MachinePipeliner, and updates the existing Mul2D end-to-end test to actually use SWP.

[AIE2] NFC: Remove un-used CHECK lines

da8d3d6

gbossu dismissed martien-de-jong’s stale review via da8d3d6 October 29, 2024 10:48

gbossu force-pushed the gaetan.licm.constant.regs branch from 180a1b7 to da8d3d6 Compare October 29, 2024 10:48

martien-de-jong approved these changes Oct 29, 2024

View reviewed changes

gbossu merged commit c7cf050 into aie-public Oct 29, 2024
8 checks passed

gbossu deleted the gaetan.licm.constant.regs branch October 29, 2024 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachineLICM to hoist instructions with constant inputs #220

MachineLICM to hoist instructions with constant inputs #220

gbossu commented Oct 23, 2024 •

edited

Loading

gbossu Oct 23, 2024

andcarminati Oct 24, 2024

andcarminati Oct 24, 2024

gbossu Oct 24, 2024

andcarminati Oct 24, 2024

gbossu Oct 28, 2024

gbossu commented Oct 24, 2024

andcarminati Oct 28, 2024

andcarminati Oct 28, 2024

gbossu Oct 28, 2024

andcarminati Oct 28, 2024

gbossu Oct 28, 2024

andcarminati Oct 28, 2024

andcarminati commented Oct 28, 2024

martien-de-jong Oct 29, 2024

martien-de-jong Oct 29, 2024 •

edited

Loading

martien-de-jong Oct 29, 2024

MachineLICM to hoist instructions with constant inputs #220

MachineLICM to hoist instructions with constant inputs #220

Conversation

gbossu commented Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbossu commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andcarminati commented Oct 28, 2024

Choose a reason for hiding this comment

martien-de-jong Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbossu commented Oct 23, 2024 •

edited

Loading

martien-de-jong Oct 29, 2024 •

edited

Loading