-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make reduction more flexible #40
Comments
I guess it's history reason because rvv 0.8 does not support fraction lmul. I prefer to keeping the original design, in your case, we can use Do worry the performance, ideally compiler can do optimization for you. |
I prefer to keep
As an aside, another way to access the result of the reduction without loss of decoupling between scalar and vector instruction pipelines (when applicable) is just to use |
Just checking in on this open issue. I'm still in favor of the current approach, using I agree with @zakk0610 that this complicates the use-case of writing the reduction to the first element in an LMUL > 1 group. In principle, a compiler could fuse the scalar <-> vector copies by intelligently allocating the reduction's destination register. However, I don't think this is a terribly common use-case, so I suspect that implementers will not prioritize this optimization. If users complain, we could always add additional intrinsics following @HanKuanChen's suggestions. Alternatively, we could implement register group fission/fusion, e.g.,
which make the copies more obvious to both the user and the compiler. I think I proposed something like this many months ago, but was not in favor of it because of the "SLEN issue", which is no longer an issue. |
This is a good feature to consider, let us settle a version first and revisit this in the future. |
How about extending Then the caller can convert anything to m1 and back around the reduction, in a type-agnostic way. |
Follow #25, the current reduction instructions use this form
I have 2 questions for this interface.
m1
type?m1
type?In my opinion, it should support like the following
To fully support the whole combination, we need 343 (7 x 7 x 7) intrinsics for
i8
type redsum.However, users will be annoyed by this interface design. To solve this, I have the following solutions.
i8
type redsum.If users want to use other LMUL to preserve the tail elements, they should do
vmerge
by themself.Add intrinsics to support different LMUL exchange (so that the parameter scalar can be more flexible), e.g.,
mf8_to_m1
andm8_to_m1
.Same intrinsics number for
i8
type redsum.In addition, this might partially solve How to combine/split vectors use rvv intrinsics? #28 and Reinterpret between different LMUL under the same SEW #37 together.
I prefer the solution 2. Any idea?
The text was updated successfully, but these errors were encountered: