Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make reduction more flexible #40

Open
HanKuanChen opened this issue Sep 7, 2020 · 5 comments
Open

Make reduction more flexible #40

HanKuanChen opened this issue Sep 7, 2020 · 5 comments
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release

Comments

@HanKuanChen
Copy link
Contributor

Follow #25, the current reduction instructions use this form

vint8m1_t vredsum_vs_i8mf8_i8m1 (vint8m1_t dst, vint8mf8_t vector, vint8m1_t scalar);
vint8m1_t vredsum_vs_i8mf4_i8m1 (vint8m1_t dst, vint8mf4_t vector, vint8m1_t scalar);
...
vint64m1_t vredsum_vs_i64m8_i64m1 (vint64m1_t dst, vint64m8_t vector, vint64m1_t scalar);

I have 2 questions for this interface.

  1. The purpose of dst is used to preserve the tail elements. Why does this only support m1 type?
  2. The scalar only use the first element. Why does this only support m1 type?

In my opinion, it should support like the following

vint8mf8_t vredsum_vs_i8mf8_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf8_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf4_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8mf2_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8mf2_t scalar);
vint8mf8_t vredsum_vs_i8mf8_i8m1_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8m1_t scalar);
...
vint8mf8_t vredsum_vs_i8mf4_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf8_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf4_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8mf2_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8mf2_t scalar);
vint8mf8_t vredsum_vs_i8mf4_i8m1_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8m1_t scalar);
...
vint8mf4_t vredsum_vs_i8mf4_i8mf8_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf8_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8mf4_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf4_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8mf2_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8mf2_t scalar);
vint8mf4_t vredsum_vs_i8mf4_i8m1_i8mf4 (vint8mf4_t dst, vint8mf4_t vector, vint8m1_t scalar);
...

To fully support the whole combination, we need 343 (7 x 7 x 7) intrinsics for i8 type redsum.

However, users will be annoyed by this interface design. To solve this, I have the following solutions.

  1. Just do it.
  2. Add a type which only describes the first vector element, and a helper intrinsic to describe the first vector element. The interface would be
    vint8_t get_first(vint8m1_t); // No op. This is not vmv_x_s.
    
    vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
    vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
    vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
    vint8mf8_t vredsum_vs_i8mf8_i8mf8 (vint8mf8_t dst, vint8mf8_t vector, vint8_t scalar);
    ...
    vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
    vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
    vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
    vint8mf8_t vredsum_vs_i8mf4_i8mf8 (vint8mf8_t dst, vint8mf4_t vector, vint8_t scalar);
    ...
    
    49 (7 x 7) intrinsics for i8 type redsum.
  3. Keep the current status.
    If users want to use other LMUL to preserve the tail elements, they should do vmerge by themself.
    Add intrinsics to support different LMUL exchange (so that the parameter scalar can be more flexible), e.g., mf8_to_m1 and m8_to_m1.
    Same intrinsics number for i8 type redsum.
    In addition, this might partially solve How to combine/split vectors use rvv intrinsics? #28 and Reinterpret between different LMUL under the same SEW #37 together.

I prefer the solution 2. Any idea?

@zakk0610
Copy link
Collaborator

zakk0610 commented Sep 7, 2020

The purpose of dst is used to preserve the tail elements. Why does this only support m1 type?
The scalar only use the first element. Why does this only support m1 type?

I guess it's history reason because rvv 0.8 does not support fraction lmul.

I prefer to keeping the original design, in your case, we can use int8_t vmv_x_s_i8m1_i8 (vint8m1_t src); to extract first element from m1 vector (reduction result), and use vint8m8_t vmv_s_x_i8m8 (vint8m8_t dst, int8_t src); to insert reduction result to any vector type you want and keep tail elements.

Do worry the performance, ideally compiler can do optimization for you.

@nick-knight
Copy link
Collaborator

I prefer to keep m1 type for the dst and scalar operands. In my opinion, this most closely matches the spec:

The scalar input and output operands are held in element 0 of a single vector register, not a vector register group, so any vector register can be the scalar source or destination of a vector reduction regardless of LMUL setting.

As an aside, another way to access the result of the reduction without loss of decoupling between scalar and vector instruction pipelines (when applicable) is just to use vrgather.vi dst, src, 0.

@nick-knight
Copy link
Collaborator

Just checking in on this open issue. I'm still in favor of the current approach, using m1 for dst and scalar operands.

I agree with @zakk0610 that this complicates the use-case of writing the reduction to the first element in an LMUL > 1 group. In principle, a compiler could fuse the scalar <-> vector copies by intelligently allocating the reduction's destination register. However, I don't think this is a terribly common use-case, so I suspect that implementers will not prioritize this optimization. If users complain, we could always add additional intrinsics following @HanKuanChen's suggestions.

Alternatively, we could implement register group fission/fusion, e.g.,

vint8m1_t fission_int8m2_m1_0 (vint8m2_t); // return zeroth vector register in group
vint8m1_t fission_int8m2_m1_1 (vint8m2_t); // return first vector register in group

which make the copies more obvious to both the user and the compiler. I think I proposed something like this many months ago, but was not in favor of it because of the "SLEN issue", which is no longer an issue.

@eopXD
Copy link
Collaborator

eopXD commented Jul 29, 2022

This is a good feature to consider, let us settle a version first and revisit this in the future.

@eopXD eopXD added the Revisit after v1.0 Features or problems we will revisit after the v1.0 release label Jul 29, 2022
@sh1boot
Copy link

sh1boot commented Dec 23, 2023

How about extendingvget_u8m1 to act as vlmul_ext_u8m1 when receiving mf types, and the complement for vset_u8m1? (don't forget to include identity transforms)

Then the caller can convert anything to m1 and back around the reduction, in a type-agnostic way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Revisit after v1.0 Features or problems we will revisit after the v1.0 release
Projects
None yet
Development

No branches or pull requests

5 participants