Skip to content

Worse code with higher min_iteration_count? #639

@andrej

Description

@andrej

I'm curious about some behavior I'm seeing from Peano using the min_iteration_count pragma.

I have a matrix-vector multiplication kernel as below, where I use that pragma to indicate that the innermost loop runs at least twice:

mv.cc:
template
void matvec_vectorized(unsigned m, unsigned k, unsigned row_offset, const bfloat16 *__restrict a, const bfloat16 *__restrict b, bfloat16 *__restrict c) {
  ::aie::set_rounding(aie::rounding_mode::conv_even); 
  c += row_offset * m;
  bfloat16 *c_end = c + m;
  const bfloat16 *b_end = b + k;
  for (; c < c_end; c++) {
    aie::accum acc = aie::accum();
    // The following two pragmas enable pipelining the zero-overhead loop, but they do assume that k is at least two.
    // This assumptions should hold for any useful use of this function; if k were one, this would be a simple scalar multiplication of a vector.
    _Pragma("clang loop min_iteration_count(2)")
    for (const bfloat16 *__restrict b_cur = b; b_cur < b_end; b_cur += r, a += r) {
      aie::vector a_vec = aie::load_v(a);
      aie::vector b_vec = aie::load_v(b_cur);
      acc = aie::mac(acc, a_vec, b_vec);
    }
    *c = (bfloat16)aie::reduce_add(acc.template to_vector());
  }
}

For this, Peano emits the following code in the innermost (zero-overhead) loop:

000000f0 <.LBB2_3>:
      f0: e1 00 00 00 00 00 00 00 00 5b 01 68 3a 76 93 03       vlda     x2, [p0], #0x40;               vldb     x4, [p3], #0x40;               nops    ;               nopxm   ;               nopv
     100: e1 00 00 00 00 00 00 00 00 5b 01 e8 39 70 ab 63       vlda     x5, [p3], #0x40;               vldb     x3, [p0], #0x40;               nops    ;               nopxm   ;               nopv
     110: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv
     120: 8b 24 09 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               vmac.f  dm1, dm1, y1, y2, r0

00000130 <.L_LEnd0>:
     130: e1 00 00 00 00 00 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopxm   ;               nopv

Note that there are a total of five VLIW instructions here.

Now, if I increase the minimum number of loop iterations in the pragma to something larger, say four, the emitted code now looks like this:

000000f0 <.LBB2_3>:
      f0: e1 00 00 00 00 00 00 00 00 5b 01 20 00 70 b3 03       vlda     x6, [p0], #0x40;               nopb    ;               nops    ;               nopxm   ;               nopv
     100: e1 00 00 00 00 00 00 00 00 5b 01 68 3c f6 2c 00       nopa    ;               vldb     x8, [p3], #0x40;               nops    ;               nopxm   ;               nopv
     110: e1 00 00 78 49 96 00 00 00 5b 01 e8 3b 70 cb 63       vlda     x9, [p3], #0x40;               vldb     x7, [p0], #0x40;               nops    ;               nopx    ;               vmov    x2, x6;          nopv
     120: 8b 24 09 78 49 18 01 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopx    ;               vmov    x4, x8;         vmac.f  dm1, dm1, y1, y2, r0
     130: e1 00 00 78 49 d7 00 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopx    ;               vmov    x3, x7;         nopv

00000140 <.L_LEnd0>:
     140: e1 00 00 78 49 59 01 00 00 5b 01 20 00 f0 2c 00       nopa    ;               nopb    ;               nops    ;               nopx    ;               vmov    x5, x9;         nopv

That's six instructions.

To me, the latter seems like worse code (potentially slower), even though it covers a subset of the uses of the first version. As some additional info, without the pragma, nine instructions are emitted in the innermost loop.

Is this expected (i.e., am I overlooking something) or is this potentially a bug worth looking into?

Note that this is not a critical issue (the design is data movement bound anyways, optimizing the kernel won't speed it up), but just in case this is indicative of an issue somewhere, I figured I'd ask.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions