Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MachineLICM to hoist instructions with constant inputs #220

Merged
merged 12 commits into from
Oct 29, 2024

Conversation

gbossu
Copy link
Collaborator

@gbossu gbossu commented Oct 23, 2024

This mostly extends the existing post-RA LICM pass so that it actually does something about instructions with register inputs. I'll see if I can upstream those changes.

Then there is a DAGMutator change to give more opportunities to MachineLICM

Better review commit by commit.

| Core_Compute_Cycle_Count   | bfloat16      | Mul2d_bf16_0 | Scale_Add_0  | Scale_Add_1  | Mul2d_bf16_1 | InstanceNormPart1_aie2_bf16_0 | BatchNorm1d_aie2_bfloat16 | BatchNorm2D_1 | LayerNormC8Part1_aie2_bf16_0 | Conv2D_ReLU_int8_1 | int8         | BatchNorm2D_0 | Tanh_0       | BatchNorm1d_aie2_int8 | Tanh_1       | ThresholdedRelu_aie2_int8 | Add2D_1      | Sin_aie2_bf16 | Conv2D_ReLU_int8_0 | Softmax_1    | Elu_aie2_int8_0 | Conv2D_DW_bf16_0 | InstanceNormPart2_aie2_bf16_0 | ReduceMeanAxis_1_aie2_bf16 | ReduceMeanAxis_4_aie2_bf16 | Rsqrt_aie2_int8_0 | ReduceMeanAxis_2_aie2_bf16 | DilatedConv2D_1 | SigmoidTemplated_int8_0 | SigmoidTemplated_int8_1 | HardswishAsHardsigmoid_aie2_0 | Hardswish_aie2_0 | Sub_aie2_int8_0 | Sub_aie2_int8_0_ptr_interface | ReduceMeanAxis_5_aie2_bf16 | ReduceMeanAxis_6_aie2_bf16 | ReduceMeanAxis_3_aie2_bf16 | Add_aie2_0   | SubBroadcasting_aie2_int8_0 | SubBroadcasting_aie2_int8_0_ptr_interface | AddBroadcasting_aie2_0 | ReduceSumAxis_1_aie2_int8 | AddAttributeBroadcasting_aie2_int8 | SubAttributeBroadcasting_aie2_int8_0 | Sin_aie2_int8 | Conv2D_DW_1  | Conv2D_SV60  | Conv2D_FC_0  | GEMM_bf16_1  | Conv2D_0     |       | AvgPool2dVariant_aie2_bf16_1 | Conv2D_1     | ReduceProdAxis_4_aie2_bf16 | ReduceProdAxis_1_aie2_bf16 | ReduceProdAxis_2_aie2_bf16 | Mul2D_0      | Mul2D_1      | HardswishAsHardsigmoid_aie2_1 | Hardswish_aie2_1 | Erf_aie2_bf16_0 | ReduceProdAxis_5_aie2_bf16 | ReduceProdAxis_6_aie2_bf16 | ReduceProdAxis_3_aie2_bf16 | ReduceProdAxis_7_aie2_bf16 | TanhTemplated_aie2_bfloat16 | MulAttributeBroadcasting_aie2_int8_0 | SigmoidTemplated_bf16_0 | GELU_0        | MulBroadcasting_aie2_0 | GELU_1        | SiLU_aie2_bf16 | Mul_aie2_0    | HardSigmoid_bf16_1 | HardSigmoid_bf16_0 | MulBroadcastingBf16_aie2_0 | MulBf16_aie2_0 | MulAttributeBroadcasting_aie2_bf16_0 | Average diff |
| -------------------------- | ------------- | ------------ | ------------ | ------------ | ------------ | ----------------------------- | ------------------------- | ------------- | ---------------------------- | ------------------ | ------------ | ------------- | ------------ | --------------------- | ------------ | ------------------------- | ------------ | ------------- | ------------------ | ------------ | --------------- | ---------------- | ----------------------------- | -------------------------- | -------------------------- | ----------------- | -------------------------- | --------------- | ----------------------- | ----------------------- | ----------------------------- | ---------------- | --------------- | ----------------------------- | -------------------------- | -------------------------- | -------------------------- | ------------ | --------------------------- | ----------------------------------------- | ---------------------- | ------------------------- | ---------------------------------- | ------------------------------------ | ------------- | ------------ | ------------ | ------------ | ------------ | ------------ |       | ---------------------------- | ------------ | -------------------------- | -------------------------- | -------------------------- | ------------ | ------------ | ----------------------------- | ---------------- | --------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | --------------------------- | ------------------------------------ | ----------------------- | ------------- | ---------------------- | ------------- | -------------- | ------------- | ------------------ | ------------------ | -------------------------- | -------------- | ------------------------------------ | ------------ |
| Baseline                   | 907(+0.00%)   | 505(+0.00%)  | 367(+0.00%)  | 367(+0.00%)  | 321(+0.00%)  | 2882(+0.00%)                  | 387(+0.00%)               | 415(+0.00%)   | 8890(+0.00%)                 | 922(+0.00%)        | 846(+0.00%)  | 306(+0.00%)   | 1964(+0.00%) | 406(+0.00%)           | 2572(+0.00%) | 865(+0.00%)               | 434(+0.00%)  | 3009(+0.00%)  | 10145(+0.00%)      | 570(+0.00%)  | 578(+0.00%)     | 1175(+0.00%)     | 9456(+0.00%)                  | 13034(+0.00%)              | 13040(+0.00%)              | 2376(+0.00%)      | 13070(+0.00%)              | 5382(+0.00%)    | 1275(+0.00%)            | 1275(+0.00%)            | 1368(+0.00%)                  | 1368(+0.00%)     | 703(+0.00%)     | 703(+0.00%)                   | 7208(+0.00%)               | 7215(+0.00%)               | 7229(+0.00%)               | 725(+0.00%)  | 753(+0.00%)                 | 753(+0.00%)                               | 775(+0.00%)            | 7235(+0.00%)              | 806(+0.00%)                        | 806(+0.00%)                          | 841(+0.00%)   | 852(+0.00%)  | 857(+0.00%)  | 2647(+0.00%) | 7661(+0.00%) | 7687(+0.00%) |  ...  | 1783(+0.00%)                 | 2458(+0.00%) | 35954(+0.00%)              | 35922(+0.00%)              | 18052(+0.00%)              | 548(+0.00%)  | 548(+0.00%)  | 1590(+0.00%)                  | 1585(+0.00%)     | 2894(+0.00%)    | 9184(+0.00%)               | 9168(+0.00%)               | 9185(+0.00%)               | 1894(+0.00%)               | 1143(+0.00%)                | 581(+0.00%)                          | 1954(+0.00%)            | 2594(+0.00%)  | 358(+0.00%)            | 3426(+0.00%)  | 3608(+0.00%)   | 295(+0.00%)   | 966(+0.00%)        | 1434(+0.00%)       | 1174(+0.00%)               | 1119(+0.00%)   | 1555(+0.00%)                         | +0.00%       |
| MachineLICM changes        | 907(+0.00%)   | 505(+0.00%)  | 371(+1.09%)  | 371(+1.09%)  | 321(+0.00%)  | 2901(+0.66%)                  | 389(+0.52%)               | 417(+0.48%)   | 8930(+0.45%)                 | 926(+0.43%)        | 849(+0.35%)  | 307(+0.33%)   | 1970(+0.31%) | 407(+0.25%)           | 2578(+0.23%) | 867(+0.23%)               | 435(+0.23%)  | 3015(+0.20%)  | 10164(+0.19%)      | 571(+0.18%)  | 579(+0.17%)     | 1177(+0.17%)     | 9472(+0.17%)                  | 13056(+0.17%)              | 13062(+0.17%)              | 2380(+0.17%)      | 13092(+0.17%)              | 5391(+0.17%)    | 1277(+0.16%)            | 1277(+0.16%)            | 1370(+0.15%)                  | 1370(+0.15%)     | 704(+0.14%)     | 704(+0.14%)                   | 7218(+0.14%)               | 7225(+0.14%)               | 7239(+0.14%)               | 726(+0.14%)  | 754(+0.13%)                 | 754(+0.13%)                               | 776(+0.13%)            | 7244(+0.12%)              | 807(+0.12%)                        | 807(+0.12%)                          | 842(+0.12%)   | 853(+0.12%)  | 858(+0.12%)  | 2650(+0.11%) | 7669(+0.10%) | 7695(+0.10%) |  ...  | 1781(-0.11%)                 | 2450(-0.33%) | 35498(-1.27%)              | 35466(-1.27%)              | 17821(-1.28%)              | 548(+0.00%)  | 548(+0.00%)  | 1590(+0.00%)                  | 1585(+0.00%)     | 2894(+0.00%)    | 8730(-4.94%)               | 8709(-5.01%)               | 8722(-5.04%)               | 1794(-5.28%)               | 1051(-8.05%)                | 517(-11.02%)                         | 1954(+0.00%)            | 2144(-17.35%) | 294(-17.88%)           | 2811(-17.95%) | 3608(+0.00%)   | 231(-21.69%)  | 649(-32.82%)       | 937(-34.66%)       | 1174(+0.00%)               | 1119(+0.00%)   | 1555(+0.00%)                         | -0.49%       |
| DAGMutator changes         | 1217(+34.18%) | 519(+2.77%)  | 374(+0.81%)  | 374(+0.81%)  | 327(+1.87%)  | 2901(+0.00%)                  | 389(+0.00%)               | 417(+0.00%)   | 8930(+0.00%)                 | 926(+0.00%)        | 849(+0.00%)  | 307(+0.00%)   | 1970(+0.00%) | 407(+0.00%)           | 2578(+0.00%) | 867(+0.00%)               | 435(+0.00%)  | 3015(+0.00%)  | 10164(+0.00%)      | 571(+0.00%)  | 579(+0.00%)     | 1177(+0.00%)     | 9472(+0.00%)                  | 13056(+0.00%)              | 13062(+0.00%)              | 2380(+0.00%)      | 13092(+0.00%)              | 5391(+0.00%)    | 1277(+0.00%)            | 1277(+0.00%)            | 1370(+0.00%)                  | 1370(+0.00%)     | 704(+0.00%)     | 704(+0.00%)                   | 7218(+0.00%)               | 7225(+0.00%)               | 7239(+0.00%)               | 726(+0.00%)  | 754(+0.00%)                 | 754(+0.00%)                               | 776(+0.00%)            | 7244(+0.00%)              | 807(+0.00%)                        | 807(+0.00%)                          | 842(+0.00%)   | 853(+0.00%)  | 858(+0.00%)  | 2650(+0.00%) | 7669(+0.00%) | 7695(+0.00%) |  ...  | 1781(+0.00%)                 | 2450(+0.00%) | 35498(+0.00%)              | 35466(+0.00%)              | 17821(+0.00%)              | 533(-2.74%)  | 533(-2.74%)  | 1527(-3.96%)                  | 1522(-3.97%)     | 2770(-4.28%)    | 8730(+0.00%)               | 8709(+0.00%)               | 8722(+0.00%)               | 1794(+0.00%)               | 1050(-0.10%)                | 517(+0.00%)                          | 1633(-16.43%)           | 2144(+0.00%)  | 294(+0.00%)            | 2811(+0.00%)  | 2908(-19.40%)  | 231(+0.00%)   | 649(+0.00%)        | 937(+0.00%)        | 752(-35.95%)               | 697(-37.71%)   | 893(-42.57%)                         | -0.37%       |
| Total diff                 | REGR(+34.18%) | REGR(+2.77%) | REGR(+1.91%) | REGR(+1.91%) | REGR(+1.87%) | REGR(+0.66%)                  | REGR(+0.52%)              | REGR(+0.48%)  | REGR(+0.45%)                 | REGR(+0.43%)       | REGR(+0.35%) | REGR(+0.33%)  | REGR(+0.31%) | REGR(+0.25%)          | REGR(+0.23%) | REGR(+0.23%)              | REGR(+0.23%) | REGR(+0.20%)  | REGR(+0.19%)       | REGR(+0.18%) | REGR(+0.17%)    | REGR(+0.17%)     | REGR(+0.17%)                  | REGR(+0.17%)               | REGR(+0.17%)               | REGR(+0.17%)      | REGR(+0.17%)               | REGR(+0.17%)    | REGR(+0.16%)            | REGR(+0.16%)            | REGR(+0.15%)                  | REGR(+0.15%)     | REGR(+0.14%)    | REGR(+0.14%)                  | REGR(+0.14%)               | REGR(+0.14%)               | REGR(+0.14%)               | REGR(+0.14%) | REGR(+0.13%)                | REGR(+0.13%)                              | REGR(+0.13%)           | REGR(+0.12%)              | REGR(+0.12%)                       | REGR(+0.12%)                         | REGR(+0.12%)  | REGR(+0.12%) | REGR(+0.12%) | REGR(+0.11%) | REGR(+0.10%) | REGR(+0.10%) |       | IMPR(-0.11%)                 | IMPR(-0.33%) | IMPR(-1.27%)               | IMPR(-1.27%)               | IMPR(-1.28%)               | IMPR(-2.74%) | IMPR(-2.74%) | IMPR(-3.96%)                  | IMPR(-3.97%)     | IMPR(-4.28%)    | IMPR(-4.94%)               | IMPR(-5.01%)               | IMPR(-5.04%)               | IMPR(-5.28%)               | IMPR(-8.14%)                | IMPR(-11.02%)                        | IMPR(-16.43%)           | IMPR(-17.35%) | IMPR(-17.88%)          | IMPR(-17.95%) | IMPR(-19.40%)  | IMPR(-21.69%) | IMPR(-32.82%)      | IMPR(-34.66%)      | IMPR(-35.95%)              | IMPR(-37.71%)  | IMPR(-42.57%)                        | -0.87%       |

I'll check the 30% regression in ReLu_bfloat16 in more detail (it comes from extra spills). But even in this state the QoR is good.

@@ -356,7 +356,8 @@ bool MachineLICMBase::runOnMachineFunction(MachineFunction &MF) {
MRI = &MF.getRegInfo();
SchedModel.init(&ST);

PreRegAlloc = MRI->isSSA();
PreRegAlloc = !MF.getProperties().hasProperty(
MachineFunctionProperties::Property::NoVRegs);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh one should never redefine PreRegAlloc, there's a reason why there is a MachineLICM and EarlyMachineLICM pass. But they don't really matter becausePreRegAlloc is redefined anyway.

This diff is more of a band-aid to make MIR tests easy to write, because the MIRParser considers MIR as SSA if it has absolutely no vreg, which is unfortunate.

@gbossu gbossu force-pushed the gaetan.licm.constant.regs branch from 21d4886 to 8d97183 Compare October 23, 2024 15:51
@@ -381,6 +383,26 @@ class PropagateIncomingLatencies : public ScheduleDAGMutation {
}))
continue;

// Do not change the latency if the REG_SEQUENCE has one source
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm, this is exactly the situation that I had in mind when I looked to MacroFusion, place REG_SEQUENCE near to the user. Great!

auto HasExternalAndLocalSources = [&MBB, &MRI](const MachineInstr &MI) {
return MI.isRegSequence() && MRI.isSSA() && MI.getNumOperands() > 3 &&
count_if(MI.uses(), [&MBB, &MRI](const MachineOperand &MO) {
return MO.isReg() && MO.getReg().isVirtual() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check: as we eliminate REG_SEQUENCEs as part of the de-ssa process, do we really need to check MRI.isSSA() and MO.getReg().isVirtual()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, I'm just careful because that DAGMutator could potentially be run at any moment, and in the middle of the "de-ssa process", we might still have reg_sequence. But you're right, I'm probably way too cautious here :D

count_if(MI.uses(), [&MBB, &MRI](const MachineOperand &MO) {
return MO.isReg() && MO.getReg().isVirtual() &&
MRI.getVRegDef(MO.getReg())->getParent() != &MBB;
}) == 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some benchmarks, we can have something like this:

 %529:eds = REG_SEQUENCE %47, %subreg.sub_mod, %50, %subreg.sub_dim_size, %53, %subreg.sub_dim_stride, %581, %subreg.sub_dim_count, %56, %subreg.sub_hi_dim_then_sub_dim_size, %59, %subreg.sub_hi_dim_then_sub_dim_stride, %582, %subreg.sub_hi_dim_then_sub_dim_count
 %530:acc512, %531:ep, %532:edc, %533:edc = VLDA_3D_CONV_FP32_BF16 %579, %529 :: (load unknown-size from %ir.in_ptr2.0, align 32, !tbaa !4, !noalias !1655, addrspace 6)

The first 3 registers come from outside so this comparison == 1 will fail. However, for Add2D_bf16_1 this is a nice thing considering the final result.

In the case of Mul2d_b16_0, this mutation leads to the opposite effect: REG_SEQUENCEs output as LC deps. If we, on the other hand, disable this mutation as a whole, we just have the lanes as LC deps and the MLICM can nicely hoist them.

With the mutation we have:

	nopb	;		vlda	wl2, [sp, #-192];		nops	;		nopxm	;		nopv	 // 32-byte Folded Reload
	vldb	wl2, [p0, #96]
	vmov	wh2, wl0
	vlda	wl7, [p0, #64];		vldb	wl10, [p1, #32]
	vst	wh2, [sp, #-96]                 // 32-byte Folded Spill
	vlda	wl9, [p1, #64];		vldb	wl2, [p0, #32]
	vst	wh2, [sp, #-160];		vmov	wh4, wl0 // 32-byte Folded Spill
	vst	wh2, [sp, #-32];		vmov	wh10, wl0;		vmul.f	bmh5, x2, x8, r6 // 32-byte Folded Spill
	vst	wl2, [sp, #-128];		mov	p4, p3;		vmul.f	bmh3, x3, x5, r6 // 32-byte Folded Spill
	vlda.3d	wl3, [p0], d0;		vldb	wl2, [p1, #96];		vst.conv.bf16.fp32	bmh0, [p4], #64;		vmul.f	bmh4, x4, x10, r6
	vldb.3d	wl5, [p1], d0;		mov	p5, p4;		vmul.f	bmh6, x6, x1, r6
	vldb	wl11, [p0, #64];		vst.conv.bf16.fp32	bmh1, [p5], #64
	vst	wl2, [sp, #-192];		mov	p6, p2 // 32-byte Folded Spill
	vlda	wl8, [p0, #32];		vldb	wl1, [p1, #32];		mov	p2, p5
	vlda	wl2, [p1, #64];		vldb	wl3, [p0, #96];		vst.conv.bf16.fp32	bmh2, [p2], #64
	vlda	wl5, [p1, #96];		vldb.3d	wl4, [p0], d0;		vst.conv.bf16.fp32	bmh3, [p5, #32];		vmov	wh5, wl0
	vldb.3d	wl6, [p1], d0;		vst.conv.bf16.fp32	bmh6, [p4, #32]
	vst.conv.bf16.fp32	bmh4, [p3, #32];		vmul.f	bmh7, x3, x5, r6
	nop	
	vst.conv.bf16.fp32	bmh5, [p6, #32]
	vst	wl2, [sp, #-64]                 // 32-byte Folded Spill
	mov	p3, p2;		vmul.f	bmh0, x7, x9, r6
	vlda	wl10, [sp, #-64];		vmov	wl6, wl8;		vmul.f	bmh2, x11, x2, r6 // 32-byte Folded Reload
.L_LEnd2:
	nopb	;		vlda	wl4, [sp, #-128];		vst.conv.bf16.fp32	bmh7, [p3], #64;		nopx	;		vmov	wl8, wl10;		vmul.f	bmh1, x4, x6, r6 // 32-byte Folded Reload

Without:

	vldb	wl3, [p0, #32];		nopxm	
	vlda	wl5, [p0, #96];		vldb	wl6, [p1, #32]
	vlda	wl2, [p0, #64];		vldb	wl8, [p1, #96]
	vlda.3d	wl10, [p0], d0;		vldb	wl4, [p1, #64]
	vldb.3d	wl1, [p1], d0
	vmul.f	bmh5, x2, x4, r6
	vmul.f	bmh3, x3, x5, r6
	vmul.f	bmh4, x6, x10, r6
	vmul.f	bmh6, x8, x1, r6
	nop	
	vlda	wl1, [p0, #32];		vldb	wl10, [p1, #32];		vst.conv.bf16.fp32	bmh2, [p3, #32]
	vst.conv.bf16.fp32	bmh5, [p3], #64;		vmul.f	bmh7, x10, x1, r6
	vst.conv.bf16.fp32	bmh0, [p3, #32]
	vst.conv.bf16.fp32	bmh3, [p3], #64
	vst.conv.bf16.fp32	bmh1, [p2, #32];		vldb	wl6, [p0, #96];		mov	p2, p3
	vlda	wl8, [p0, #64];		vldb	wl10, [p1, #96];		vst.conv.bf16.fp32	bmh6, [p2], #64;		vmul.f	bmh1, x3, x6, r6
	vldb	wl1, [p1, #64];		vlda.3d	wl3, [p0], d0;		vst.conv.bf16.fp32	bmh4, [p3, #32];		nopx	;		mov	p3, p2;		vmul.f	bmh2, x5, x8, r6
.L_LEnd2:
	vldb.3d	wl5, [p1], d0;		nopa	;		vst.conv.bf16.fp32	bmh7, [p3], #64;		nopx	;		vmov	wh4, wl0;		vmul.f	bmh0, x1, x10, r6

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have experimented with different options, like allowing more than one external source. I indeed see some improvements in the whole Mul family of benchmarks, but also regressions for others. In the end, it really seems down to luck (meaning: how the MachinePipeliner places the instructions), so I'd rather not touch the logic until we have a more solid plan.

Those BitVectors get expensive on targets like AMDGPU with thousands of
registers, and RegAliasIterator is also expensive.

We can move all liveness calculations to use RegUnits instead to speed
it up for targets where RegAliasIterator is expensive, like AMDGPU.
On targets where RegAliasIterator is cheap, this alternative can be a little more expensive, but I believe the tradeoff is worth it.
Fix regression introduced in d4b8b72
Reverts the behavior introduced by 770393b while keeping the refactored
code.

Fixes a miscompile on AArch64, at the cost of a small regression on
AMDGPU.
@gbossu gbossu force-pushed the gaetan.licm.constant.regs branch from 8d97183 to 0c59352 Compare October 24, 2024 17:33
@gbossu
Copy link
Collaborator Author

gbossu commented Oct 24, 2024

note: we are lagging behind upstream by a couple of months, so i cherry-picked some commits from there to minimise conflicts.

@@ -597,14 +599,6 @@ void MachineLICMBase::HoistRegionPostRA(MachineLoop *CurLoop,
const MachineLoop *ML = MLI->getLoopFor(BB);
if (ML && ML->getHeader()->isEHPad()) continue;

// Conservatively treat live-in's as an external def.
// FIXME: That means a reload that're reused in successor block(s) will not
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we removed this fixme (and extended the implementation accordingly), I think this can be well received by the community.

for (MCRegUnitIterator RUI(LoopLiveInReg, TRI); RUI.isValid(); ++RUI) {
if (RUDefs.test(*RUI)) {
RUClobbers.set(*RUI);
LaneBitmask LiveInMask = LoopLI.LaneMask;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check: we filter off some cases where other aliasing reg units are live, but with disjoint lanes. Maybe some comment to clarify could be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, we only account for the live reg units that are part of the lane mask

MRI.getVRegDef(MO.getReg())->getParent() != &MBB;
});
const auto NumInternal = MI.getNumOperands() - 1 - (2 * NumExternal);
return NumExternal == 1 && NumInternal >= 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, do we consider a subregister index of an internal value as accounting for NumInternal ? Should we divide NumInternal by 2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely clear on what would be the best heuristic tbh. As mentioned here #220 (comment) I have experimented a bit, but it is challenging to find something that consistently yields good results. The current code works pretty well (see results in PR description), and I'm a bit afraid to specialise the heuristic too much for our current benchmarks if we keep tweaking it. I would suggest leaving that basic heuristic intact, and tweak it in follow-up work as we see fit. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a heuristic, we can leave as is because it can give us a nice help. I also tried some experiments during the review, and I think it is good as is (I also know that is hard to tune it as well).

@andcarminati
Copy link
Collaborator

This PR extends MachineLICM in a very clever way. I left some minor comments, mostly for clarification.

Both are based on MachineLICMBase, and the functionality there is
"switched" based on a PreRegAlloc flag. This commit is simply about
trusting the original value of that flag, instead of overwriting it
based on MRI.isSSA(), which is un-reliable
@gbossu gbossu force-pushed the gaetan.licm.constant.regs branch from d0ab739 to 180a1b7 Compare October 28, 2024 14:31
@@ -614,6 +608,16 @@ void MachineLICMBase::HoistRegionPostRA(MachineLoop *CurLoop,
ProcessMI(&MI, RUDefs, RUClobbers, StoredFIs, Candidates, CurLoop);
}

// Mark registers as clobbered if they are defined in the loop and also livein
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would describe this code as ; if they are livein and also defined in the loop

# "tie" the sources together.

# We expect the REG_SEQUENCE for the load of %ir.in1 to be in the second stage, close to
# its VMUL consumer.
Copy link
Collaborator

@martien-de-jong martien-de-jong Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps say explicitly you don't want the VMUL in the steady state to read a PHI node, but rather the REG_SEQUENCE ("close to" in a SWP schedule is a bit ambiguous)

return NumExternal == 1 && NumInternal >= 1;
};
if (OnlyLocalSources && HasExternalAndLocalSources(MI))
MoveLatToSuccessors = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can const-initialize MoveLatToSuccessors here.

This allows to hoist instructions using registers that are not
re-defined in the loop. Previous MachineLICM basically could not hoist
any instruction using register inputs.
This adds an MIR test specifically for the MachinePipeliner, and
updates the existing Mul2D end-to-end test to actually use SWP.
This is now very careful about REG_SEQUENCE that have an external
source. That source is likely to create a COPY during regalloc, and we
need to be careful to ensure that copy can be later hoisted by LICM.

See tests :)
@gbossu gbossu merged commit c7cf050 into aie-public Oct 29, 2024
8 checks passed
@gbossu gbossu deleted the gaetan.licm.constant.regs branch October 29, 2024 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants