Worse code with higher `min_iteration_count`?

I'm curious about some behavior I'm seeing from Peano using the `min_iteration_count` pragma.

I have a matrix-vector multiplication kernel as below, where I use that pragma to indicate that the innermost loop runs at least twice:
<details>
<summary><b>mv.cc:</b></summary>
<pre>
template<unsigned r>
void matvec_vectorized(unsigned m, unsigned k, unsigned row_offset, const bfloat16 *__restrict a, const bfloat16 *__restrict b, bfloat16 *__restrict c) {
  ::aie::set_rounding(aie::rounding_mode::conv_even); 
  c += row_offset * m;
  bfloat16 *c_end = c + m;
  const bfloat16 *b_end = b + k;
  for (; c < c_end; c++) {
    aie::accum acc = aie::accum<accfloat, r>();
    // The following two pragmas enable pipelining the zero-overhead loop, but they do assume that k is at least two.
    // This assumptions should hold for any useful use of this function; if k were one, this would be a simple scalar multiplication of a vector.
    _Pragma("clang loop min_iteration_count(2)")
    for (const bfloat16 *__restrict b_cur = b; b_cur < b_end; b_cur += r, a += r) {
      aie::vector<bfloat16, r> a_vec = aie::load_v<r>(a);
      aie::vector<bfloat16, r> b_vec = aie::load_v<r>(b_cur);
      acc = aie::mac(acc, a_vec, b_vec);
    }
    *c = (bfloat16)aie::reduce_add(acc.template to_vector<float>());
  }
}
</pre>
</details>

For this, Peano emits the following code in the innermost (zero-overhead) loop:

```
000000f0 <.LBB2_3>:
      f0: e1 00 00 00 00 00 00 00 00 5b 01 68 3a 76 93 03       vlda     x2, [p0], #0x40;               vldb     x4, [p3], #0x40;               nops    ;               nopxm   ;               nopv
     100: e1 00 00 00 00 00 00 00 00 5b 01 e8 39 70 ab 63       vlda     x5, [p3], #0x40;               vldb     x3, [p0], #0x40;               nops    ;               nopxm   ;               nopv
     110: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
     120: 8b 24 09 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               vmac.f  dm1, dm1, y1, y2, r0

00000130 <.L_LEnd0>:
     130: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
```

Note that there are a total of **five** VLIW instructions here.

Now, if I increase the minimum number of loop iterations in the pragma to something larger, say four, the emitted code now looks like this:

```
000000f0 <.LBB2_3>:
      f0: e1 00 00 00 00 00 00 00 00 5b 01 20 00 70 b3 03       vlda     x6, [p0], #0x40;               nopb    ;               nops    ;               nopxm   ;               nopv
     100: e1 00 00 00 00 00 00 00 00 5b 01 68 3c f6 2c 00       nopa    ;               vldb     x8, [p3], #0x40;               nops    ;               nopxm   ;               nopv
     110: e1 00 00 78 49 96 00 00 00 5b 01 e8 3b 70 cb 63       vlda     x9, [p3], #0x40;               vldb     x7, [p0], #0x40;               nops    ;               nopx    ;               vmov    x2, x6;          nopv
     120: 8b 24 09 78 49 18 01 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopx    ;               vmov    x4, x8;         vmac.f  dm1, dm1, y1, y2, r0
     130: e1 00 00 78 49 d7 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopx    ;               vmov    x3, x7;         nopv

00000140 <.L_LEnd0>:
     140: e1 00 00 78 49 59 01 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopx    ;               vmov    x5, x9;         nopv
```

That's **six** instructions.

To me, the latter seems like worse code (potentially slower), even though it covers a subset of the uses of the first version. As some additional info, without the pragma, nine instructions are emitted in the innermost loop.

Is this expected (i.e., am I overlooking something) or is this potentially a bug worth looking into?

Note that this is not a critical issue (the design is data movement bound anyways, optimizing the kernel won't speed it up), but just in case this is indicative of an issue somewhere, I figured I'd ask.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Worse code with higher `min_iteration_count`? #639

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Worse code with higher min_iteration_count? #639

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Worse code with higher `min_iteration_count`? #639