-
Couldn't load subscription status.
- Fork 29
Description
I'm curious about some behavior I'm seeing from Peano using the min_iteration_count pragma.
I have a matrix-vector multiplication kernel as below, where I use that pragma to indicate that the innermost loop runs at least twice:
mv.cc:
template
void matvec_vectorized(unsigned m, unsigned k, unsigned row_offset, const bfloat16 *__restrict a, const bfloat16 *__restrict b, bfloat16 *__restrict c) {
::aie::set_rounding(aie::rounding_mode::conv_even);
c += row_offset * m;
bfloat16 *c_end = c + m;
const bfloat16 *b_end = b + k;
for (; c < c_end; c++) {
aie::accum acc = aie::accum();
// The following two pragmas enable pipelining the zero-overhead loop, but they do assume that k is at least two.
// This assumptions should hold for any useful use of this function; if k were one, this would be a simple scalar multiplication of a vector.
_Pragma("clang loop min_iteration_count(2)")
for (const bfloat16 *__restrict b_cur = b; b_cur < b_end; b_cur += r, a += r) {
aie::vector a_vec = aie::load_v(a);
aie::vector b_vec = aie::load_v(b_cur);
acc = aie::mac(acc, a_vec, b_vec);
}
*c = (bfloat16)aie::reduce_add(acc.template to_vector());
}
}
For this, Peano emits the following code in the innermost (zero-overhead) loop:
000000f0 <.LBB2_3>:
f0: e1 00 00 00 00 00 00 00 00 5b 01 68 3a 76 93 03 vlda x2, [p0], #0x40; vldb x4, [p3], #0x40; nops ; nopxm ; nopv
100: e1 00 00 00 00 00 00 00 00 5b 01 e8 39 70 ab 63 vlda x5, [p3], #0x40; vldb x3, [p0], #0x40; nops ; nopxm ; nopv
110: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv
120: 8b 24 09 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; vmac.f dm1, dm1, y1, y2, r0
00000130 <.L_LEnd0>:
130: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopxm ; nopv
Note that there are a total of five VLIW instructions here.
Now, if I increase the minimum number of loop iterations in the pragma to something larger, say four, the emitted code now looks like this:
000000f0 <.LBB2_3>:
f0: e1 00 00 00 00 00 00 00 00 5b 01 20 00 70 b3 03 vlda x6, [p0], #0x40; nopb ; nops ; nopxm ; nopv
100: e1 00 00 00 00 00 00 00 00 5b 01 68 3c f6 2c 00 nopa ; vldb x8, [p3], #0x40; nops ; nopxm ; nopv
110: e1 00 00 78 49 96 00 00 00 5b 01 e8 3b 70 cb 63 vlda x9, [p3], #0x40; vldb x7, [p0], #0x40; nops ; nopx ; vmov x2, x6; nopv
120: 8b 24 09 78 49 18 01 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopx ; vmov x4, x8; vmac.f dm1, dm1, y1, y2, r0
130: e1 00 00 78 49 d7 00 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopx ; vmov x3, x7; nopv
00000140 <.L_LEnd0>:
140: e1 00 00 78 49 59 01 00 00 5b 01 20 00 f0 2c 00 nopa ; nopb ; nops ; nopx ; vmov x5, x9; nopv
That's six instructions.
To me, the latter seems like worse code (potentially slower), even though it covers a subset of the uses of the first version. As some additional info, without the pragma, nine instructions are emitted in the innermost loop.
Is this expected (i.e., am I overlooking something) or is this potentially a bug worth looking into?
Note that this is not a critical issue (the design is data movement bound anyways, optimizing the kernel won't speed it up), but just in case this is indicative of an issue somewhere, I figured I'd ask.