-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement RISC-V Vector 1.0 kernels #774
Conversation
set(QEMU_VLEN "128") | ||
endif() | ||
|
||
set(CMAKE_CROSSCOMPILING_EMULATOR "qemu-riscv64-static -L /usr/riscv64-linux-gnu/ -cpu rv64,v=on,vlen=${QEMU_VLEN},rvv_ta_all_1s=on,rvv_ma_all_1s=on") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's generally good to enable zba
and zbb
as well as they are available on all boards that have v
and they bring up some nice performance boost on scalar code. You can enable with zba=true,zbb=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good idea, I'll add that.
I didn't think to include it, because google/cpu_features
currently doesn't support detecting the bitmanip extensions.
I tried to patch this upstream, but couldn't figure out how to sign the CLA without a google account: google/cpu_features#369
Their extension detection is fundamentally broken anyways, see: google/cpu_features#368 (It would currently parse rv64gc_xmycustomextensionwithavsomewhere
as rv64gcv
)
Maybe you could forward this to the Google people at RISE.
google/cpu_features
should probably use hwprobe instead of erroneously parsing cpuinfo, and it would be amazing if support for profiles and extension groups was added.
Wow, I gotta get my BPI-F3 set up. I'll be testing with GNU Radio. |
Thanks to cloud-v.co I got access to a machine with C910 cores, which doesn't support RVV 1.0, but rather the XTheadVector custom extension based on the RVV 0.7.1 draft specification, with VLEN=128. The XTheadVector target for gcc isn't perfect, so I couldn't get every kernel to compile properly, but here are the results from the once that worked:
Keep in mind that the codegen for XTheadVector is often worse than for RVV 1.0, because additional instructions need to be inserted. Edit: The latest force push fixed register spills in two kernels |
Signed-off-by: Olaf Bernstein <[email protected]>
I ran clang-format on all changed files, so the formatting issues should hopeful be fixed now. |
Thanks! Also, I was doing a review and got an "outdated file" (or smth similar) error at some point. Was already worried that comments got lost. Luckily they aren't. Thanks for taking care of the formatting stuff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very impressive PR. Thank you!
I have a few minor comments. Besides, the PR looks good to me.
You probably need to rebase your PR on the latest |
Signed-off-by: Olaf Bernstein <[email protected]>
Signed-off-by: Olaf Bernstein <[email protected]>
Signed-off-by: Olaf Bernstein <[email protected]>
Signed-off-by: Olaf Bernstein <[email protected]>
I got my BPI-F3 running today. I'm using Bianbu 2.0.1, but I've ripped out the desktop and just using it headless. Here's my volk_profile results. Should have GNU Radio running tomorrow. |
Tested with GNU Radio. Works perfectly for at least these kernels:
|
Thanks for this huge contribution! I'll do a release soon. I hope you don't mind that I mention your name (as it appears in the sign-off) in the release changelog. |
Yeah, that be great. |
This PR adds full RVV vectorization to all kernels, excluding deprecated ones and
volk_32f_s32f_power_32f
andvolk_32fc_s32f_power_32fc
which don't have any vectorized kernels in for other architectures, presumably due to precision requirements.All the tests pass in qemu with VLEN 128 up to 1024, and were additionally tested on all input sizes from 0 to 1000. They have been integrated into CI, with both clang and gcc.
I've attached the output of volk_profile running on the SpacemiT X60 cores of the Banana Pi BPI-F3 SOC: X60-gcc-14.txt. X60-clang-18.txt
The average speedup across all kernels compared to the previously fastest one was 3.8x.
The code is written using RVV 1.0 intrinsics, following the frozen v0.12 spec, which is supported by gcc 14 and clang 18 and above. The build system was adjusted to detect support for RVV intrinsics, to make sure we don't break builds on older compilers.
I tried to maximize LMUL without causing register spills, while avoiding lane crossing permutations.
Segmented load/stores varies in performance a lot on current systems, compare the vsseg graph between the C910 and X60 . To get the best performance I didn't use it in the rvv target, but created an additional pseudo target rvvseg, with alternative implementations using segmented load/stores.
I didn't modify the existing kernels, except for the following cases:
volk_32u_popcntpuppet_32u.h
andvolk_64u_popcntpuppet_64u.h
had a bug where they didn't compute anything.volk_8u_conv_k7_r2puppet_8u.h
created a lookup table with 256 entries once, but only used for a total of 64 lookups, so I replaced it with the direct calculation.This should resolve #772