Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378

Open
camel-cdr opened this issue Oct 21, 2024 · 9 comments

Comments

@camel-cdr
Copy link
Contributor

The absence of whole register load/store instructions was already discussed in the past with the following conclusion:

The conclusion is that the API intentionally omits whole-register loads, stores, and moves. The rationale is that the usual loads/stores/moves provide the same functionality, and a compiler could instead generate whole-register versions if it thought it would be profitable. If a compelling use-case arises in the future, we could introduce new intrinsics in a backwards-compatible way.

I'd like to posit what I think is a "compelling use-case".

A lot of libraries use a fixed size SIMD abstraction type that allows code sharing between existing SIMD ISAs (sse,avx,neon,...): simdjson, vectorscan, ...
This requires the ability to store the vector register state in data structures, which is currently only properly possible via the riscv_rvv_vector_bits attribute extension supported by both gcc and clang, since it requires a fixed size, known at compile time.
This attribute isn't standardized, and can only be used for a single VLEN without potentially majorly rearchitecting code structure and build system, as it depends on the -mrvv-vector-bits command line argument.

An alternative approach for implementing such abstract SIMD types is to have all operations load/store from a fixed width buffer in the structure, and rely on the compiler eliminating redundant load/stores. This avoids having to store the variable length vector register directly in the SIMD class, and allows multiple implementations of this SIMD type for different VLEN. The generated code using these types would have to be runtime dispatched based on the actual VLEN.
It could also be used to e.g. create a SIMD type that can be stored in data structures, and works for VLEN 128, 256 and 512, by making the buffer always 512-bit wide, and just not using the extra bits for VLEN<512. This isn't possible using the riscv_rvv_vector_bits attribute either, because it assumes a single fixed VLEN.

This approach is however unusable, since no current compiler is capable of eliminating redundant predicated load/stores: https://godbolt.org/z/TdajMTMKT
The actual function in the link above represents what the codegen for such an implementation currently looks like, and expected simulations what the codegen should be with redundant load/store elimination.
As you can see, even when always using vl=VLMAX no redundant load/stores are removed.
Since the RVV compiler backends aren't as mature, I also compared how the compilers handle predicated (masked) vs unpredicated AVX512 load/stores.
There you can observe, that predicated redundant AVX512 load/stores also can't be eliminated, but unpredicated ones can.

Hence, I suggest adding unpredicated RVV load/store intrinsics, aka the whole register load/stores, to help facilitate the compilers load/store elimination in this use-case.

@camel-cdr
Copy link
Contributor Author

I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures.

The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT

@topperc
Copy link
Collaborator

topperc commented Oct 29, 2024

I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures.

The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

@dzaima
Copy link

dzaima commented Oct 29, 2024

Some LLVM IR surgery to get good codegen, even being VLEN-agnostic: https://godbolt.org/z/jh8YK394Thttps://godbolt.org/z/ePrGdzEKx

I'd imagine wouldn't be hard for a compiler to recognize & convert __riscv_vle/__riscv_vse with a constant vl into their native loads/stores similarly to how shown here.

Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't.

@camel-cdr
Copy link
Contributor Author

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data.
It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz
Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

@topperc
Copy link
Collaborator

topperc commented Oct 29, 2024

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data. It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to uint32_t *data makes the struct smaller so it fits in a GPR.

@kito-cheng
Copy link
Collaborator

The core problem you're trying to resolve is using vector intrinsics to implement some fixed-length vector functionality. Honestly, this isn't the focus of the RVV intrinsic spec (at least in version 1.0). A straightforward approach is to use GNU vectors (e.g., typedef int int32x4_t __attribute__ ((vector_size (16)));), which is well-supported by both compilers and generates good code quality. However, the issue is that all operations are supported by built-in operators in C/C++, so we eventually need to convert GNU types to RVV types, which leads to several redundant load/store operations that are hard to eliminate.

This reflects a limitation in the current compiler implementation, as it doesn't handle memory analysis for scalable vector types very effectively.

Returning to the main problem we want to solve: creating an easier programming model for SIMD-style programs while also improving code generation. One idea I have is to provide an alternative set of intrinsics to improve both user experience and code generation quality. Here’s a concrete example:

   int32x4_t a, b, c;
   a = __riscv_vle32(int32x4_t, ptr_a, 4);
   // We could also provide an overloaded version for VLMAX, e.g., __riscv_vle32(int32x4_t, ptr_a);
   // Or simply use: a = *(int32x4_t *)ptr_a;
   // ----
   b = __riscv_vle32(int32x4_t, ptr_b, 4);
   c = __riscv_vadd(int32x4_t, a, b, 4);
   // Alternative syntax: c = a + b;
   // or c = __riscv_vadd(int32x4_t, a, b);
   // ----
   __riscv_vse32(int32x4_t, ptr_c, c);
   // Or: *(int32x4_t *)ptr_c = c;

This approach was discussed in the early stages of the RVV intrinsics, but it wasn’t prioritized, so it didn’t come to fruition.

Another idea I have is to try converting scalable vector types to fixed-length vectors, which might improve code generation quality. However, this would require significant engineering effort, so upstream toolchain compilers may not consider it unless there's strong motivation.

In conclusion, I would say that introducing intrinsics for whole-register vector load/store doesn’t truly solve the issue...the real problem lies in the compiler implementation.

@kito-cheng
Copy link
Collaborator

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data. It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to uint32_t *data makes the struct smaller so it fits in a GPR.

FYI: This ABI proposal is trying to resolve this issue:

riscv-non-isa/riscv-elf-psabi-doc#418

@topperc
Copy link
Collaborator

topperc commented Nov 8, 2024

What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b?

@dzaima
Copy link

dzaima commented Nov 8, 2024

What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b?

That's what I noted as an option here before:

Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't.

I don't think there necessarily needs to be a restriction on the relative size. Without the restriction, you could have a 32B or 64B buffer and do generic code over VLEN=128/256/512 (VLEN≥1024 would still work, but only use the low 512 bits), allowing good VLEN=256/512 perf while still being compatible with VLEN=128). Also noted as an idea here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants