Skip to content

Conversation

@andcarminati
Copy link
Collaborator

This PR brings optimality in terms of pipelined loops for several important kernels, however with some degradation on the other side. Que next question is, how to decide when we can enable it? It should be on a per-loop basis.

image

if (!Reg.isVirtual())
continue;

auto OperandCycle = ItinData->getOperandCycle(SchedClass, I);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move next to use?

IsHighLat = true;
}

if (IsHighLat) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early return?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps even return <closed expression>

@andcarminati
Copy link
Collaborator Author

With the new cutoff commit:
Core_Insn_Count

Improvement: 1.11% avg.

@krishnamtibrewala
Copy link
Collaborator

With the new cutoff commit: Core_Insn_Count

Improvement: 1.11% avg.

Curious to know how was the cut-off decided, and did it lead to reduction in perf, on benchmark where we were getting better performance without the cut-off logic

@andcarminati
Copy link
Collaborator Author

With the new cutoff commit: Core_Insn_Count
Improvement: 1.11% avg.

Curious to know how was the cut-off decided, and did it lead to reduction in perf, on benchmark where we were getting better performance without the cut-off logic

The original goal of this PR was Conv2D_DW_bf16, but I saw that it also affected Gemm. In this way, I decided to keep those kernels on the heuristic radar. This whole work was based on assembly code analysis (including the cutoff heuristic).


// Only consider to front physical registers that also
// used by high latency operands.
if (HighLatencyRegs.count(AssignedPhysReg))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see here, we reverse the MO ordering by pushing front. This is not by mistake, this helps a lot!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this as a source code comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK: We push them to the front so that they get replaced first?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rewrite the comment to
High latency registers should be renamed first, therefore insert them at the front

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also rewrite for clarity:

const auto InsertPoint = HighLatencyRegs.count(AssignedPhysReg) ? Candidates.begin() : Candidates.end(); 
Candidates.emplace(InsertPoint , &MO, AssignedPhysReg);

}
}

// This metric gives us an idea about the "demand" for high latency registers
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ratio regs/instructions.

MCRegister AssignedPhysReg = VRM->getPhys(Reg);
Candidates.emplace_back(&MO, AssignedPhysReg);

// Only consider to front physical registers that also
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to put in front registers that are also used

getLastVRegDef(const MachineBasicBlock &MBB) const;

std::set<MCRegister>
getHighOutputLatencyRegs(const MachineBasicBlock *MBB) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a comment.
Is this the total live range of a physical register or on a single Live Interval?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we looking here at the physical register mapped to a single virtual register def with a high output latency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One MCRegister can also could be mapped to more than one VReg (defs), provided that at least one is high latency.


// Only consider to front physical registers that also
// used by high latency operands.
if (HighLatencyRegs.count(AssignedPhysReg))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this as a source code comment


// Only consider to front physical registers that also
// used by high latency operands.
if (HighLatencyRegs.count(AssignedPhysReg))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK: We push them to the front so that they get replaced first?

bool IsHighLat = false;
if (OperandCycle.has_value()) {
if (OperandCycle.value() >= MinRegisterLatency)
IsHighLat = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add brackets here to differentiate between the !OperandCycle.has_value() case

if (OperandCycle.has_value()) {
if (OperandCycle.value() >= MinRegisterLatency)
IsHighLat = true;
} else if (MI.mayLoad()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't we always have IsHighLat==true when we encounter a load instruction?
I think we could simplify the conditions to :

auto SetIsHighLat = [&ItinData, &MinRegisterLatency](const MachineInstr &) {
  if (MI.mayLoad())
    return true;

  auto OperandCycle = ItinData->getOperandCycle(SchedClass, I);
  if (!OperandCycle)
    return false;

  return OperandCycle.value() >= MinRegisterLatency;
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion, I will just trust the load itinerary before, because we insert less registers in the set and this could bias the cutoff.


auto *ItinData = MF->getSubtarget().getInstrItineraryData();
std::set<MCRegister> HighLatRegisters;
unsigned NInstrs = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why don't we querry MBB.size() ?

// latency-aware heuristic.
// TODO: this should be replaced by more stable metrics related to SWP.
if (!HighLatRegisters.empty() &&
(((HighLatRegisters.size() * 100 / NInstrs)) < 250 /*calibrated value*/))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for readability add a nice name for (HighLatRegisters.size() * 100 / NInstrs). Something like: HighLatencyRegisterInstrRatio

Copy link
Collaborator

@martien-de-jong martien-de-jong Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HighLatencyRegisterCountToInstrCountRatioInPercent

Copy link
Collaborator Author

@andcarminati andcarminati Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the name according @F-Stuckmann and extended the comment to accommodate @martien-de-jong observation.

cl::init(3));

static cl::opt<bool>
LatencyAware("aie-realloc-latencyaware", cl::Hidden, cl::init(false),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the last commit set the option to true, since with the cutoff it is now profitable.

@andcarminati andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch 2 times, most recently from 65f77fc to 63bbf0d Compare October 24, 2025 07:25
getLastVRegDef(const MachineBasicBlock &MBB) const;

std::set<MCRegister>
getHighOutputLatencyRegs(const MachineBasicBlock *MBB) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we looking here at the physical register mapped to a single virtual register def with a high output latency?


// Only consider to front physical registers that also
// used by high latency operands.
if (HighLatencyRegs.count(AssignedPhysReg))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rewrite the comment to
High latency registers should be renamed first, therefore insert them at the front


// Only consider to front physical registers that also
// used by high latency operands.
if (HighLatencyRegs.count(AssignedPhysReg))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also rewrite for clarity:

const auto InsertPoint = HighLatencyRegs.count(AssignedPhysReg) ? Candidates.begin() : Candidates.end(); 
Candidates.emplace(InsertPoint , &MO, AssignedPhysReg);

if (!Reg.isVirtual())
continue;

auto IsHighLatInstrOperand = [&]() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity. Why is this lambda not outside of the for loops?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, my goal was to have a lazy evaluation boolean where we can state the conditions very explicitly. It is inside so we can capture everything instead of using parameters (clean if statement).

@andcarminati andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch from 63bbf0d to 6da3b15 Compare October 24, 2025 12:29
; CHECK-NEXT: vlda.conv.fp32.bf16 cml4, [p0], m1
; CHECK-NEXT: vlda.conv.fp32.bf16 cml2, [p1], m0
; CHECK-NEXT: vlda.conv.fp32.bf16 cml0, [p1], m0; movx r1, #60
; CHECK-NEXT: vlda.conv.fp32.bf16 cml1, [p0], m1; add.nc lc, r0, #-3; vadd.f dm3, dm0, dm1, r1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reach one more stage now.

@andcarminati andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch from 6da3b15 to e213349 Compare October 24, 2025 13:46
@andcarminati
Copy link
Collaborator Author

andcarminati commented Oct 24, 2025

Complete QoR view:

Core_Insn_Count Core_StackSize_absolute Core_PMSize_absolute

(PM increase when we increase the number of SWP stages)

…istic

This commit adds an end-to-end test that fails to be scheduled (post-swp) optimally because
of the actual register allocation. This regression is introduced by the latency-aware
WAWRegRewriter strategy.
Ideally, we should extend with SWP metrics.
@andcarminati andcarminati force-pushed the andreu.waw.rewriter.latency.aware branch from 401e86b to 30888fc Compare October 24, 2025 14:59
@F-Stuckmann
Copy link
Collaborator

What happens with stack and PM ?

@andcarminati
Copy link
Collaborator Author

What happens with stack and PM ?

I added to the previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants