-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378
Comments
I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures. The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT |
I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4 |
Some LLVM IR surgery to get good codegen, even being VLEN-agnostic: https://godbolt.org/z/jh8YK394T → https://godbolt.org/z/ePrGdzEKx I'd imagine wouldn't be hard for a compiler to recognize & convert Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't. |
I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of |
The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to |
The core problem you're trying to resolve is using vector intrinsics to implement some fixed-length vector functionality. Honestly, this isn't the focus of the RVV intrinsic spec (at least in version 1.0). A straightforward approach is to use GNU vectors (e.g., This reflects a limitation in the current compiler implementation, as it doesn't handle memory analysis for scalable vector types very effectively. Returning to the main problem we want to solve: creating an easier programming model for SIMD-style programs while also improving code generation. One idea I have is to provide an alternative set of intrinsics to improve both user experience and code generation quality. Here’s a concrete example: int32x4_t a, b, c;
a = __riscv_vle32(int32x4_t, ptr_a, 4);
// We could also provide an overloaded version for VLMAX, e.g., __riscv_vle32(int32x4_t, ptr_a);
// Or simply use: a = *(int32x4_t *)ptr_a;
// ----
b = __riscv_vle32(int32x4_t, ptr_b, 4);
c = __riscv_vadd(int32x4_t, a, b, 4);
// Alternative syntax: c = a + b;
// or c = __riscv_vadd(int32x4_t, a, b);
// ----
__riscv_vse32(int32x4_t, ptr_c, c);
// Or: *(int32x4_t *)ptr_c = c; This approach was discussed in the early stages of the RVV intrinsics, but it wasn’t prioritized, so it didn’t come to fruition. Another idea I have is to try converting scalable vector types to fixed-length vectors, which might improve code generation quality. However, this would require significant engineering effort, so upstream toolchain compilers may not consider it unless there's strong motivation. In conclusion, I would say that introducing intrinsics for whole-register vector load/store doesn’t truly solve the issue...the real problem lies in the compiler implementation. |
FYI: This ABI proposal is trying to resolve this issue: |
What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b? |
That's what I noted as an option here before:
I don't think there necessarily needs to be a restriction on the relative size. Without the restriction, you could have a 32B or 64B buffer and do generic code over VLEN=128/256/512 (VLEN≥1024 would still work, but only use the low 512 bits), allowing good VLEN=256/512 perf while still being compatible with VLEN=128). Also noted as an idea here. |
The absence of whole register load/store instructions was already discussed in the past with the following conclusion:
I'd like to posit what I think is a "compelling use-case".
A lot of libraries use a fixed size SIMD abstraction type that allows code sharing between existing SIMD ISAs (sse,avx,neon,...): simdjson, vectorscan, ...
This requires the ability to store the vector register state in data structures, which is currently only properly possible via the
riscv_rvv_vector_bits
attribute extension supported by both gcc and clang, since it requires a fixed size, known at compile time.This attribute isn't standardized, and can only be used for a single VLEN without potentially majorly rearchitecting code structure and build system, as it depends on the
-mrvv-vector-bits
command line argument.An alternative approach for implementing such abstract SIMD types is to have all operations load/store from a fixed width buffer in the structure, and rely on the compiler eliminating redundant load/stores. This avoids having to store the variable length vector register directly in the SIMD class, and allows multiple implementations of this SIMD type for different VLEN. The generated code using these types would have to be runtime dispatched based on the actual VLEN.
It could also be used to e.g. create a SIMD type that can be stored in data structures, and works for VLEN 128, 256 and 512, by making the buffer always 512-bit wide, and just not using the extra bits for VLEN<512. This isn't possible using the
riscv_rvv_vector_bits
attribute either, because it assumes a single fixed VLEN.This approach is however unusable, since no current compiler is capable of eliminating redundant predicated load/stores: https://godbolt.org/z/TdajMTMKT
The
actual
function in the link above represents what the codegen for such an implementation currently looks like, andexpected
simulations what the codegen should be with redundant load/store elimination.As you can see, even when always using vl=VLMAX no redundant load/stores are removed.
Since the RVV compiler backends aren't as mature, I also compared how the compilers handle predicated (masked) vs unpredicated AVX512 load/stores.
There you can observe, that predicated redundant AVX512 load/stores also can't be eliminated, but unpredicated ones can.
Hence, I suggest adding unpredicated RVV load/store intrinsics, aka the whole register load/stores, to help facilitate the compilers load/store elimination in this use-case.
The text was updated successfully, but these errors were encountered: