Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378

camel-cdr · 2024-10-21T16:07:36Z

The absence of whole register load/store instructions was already discussed in the past with the following conclusion:

The conclusion is that the API intentionally omits whole-register loads, stores, and moves. The rationale is that the usual loads/stores/moves provide the same functionality, and a compiler could instead generate whole-register versions if it thought it would be profitable. If a compelling use-case arises in the future, we could introduce new intrinsics in a backwards-compatible way.

I'd like to posit what I think is a "compelling use-case".

A lot of libraries use a fixed size SIMD abstraction type that allows code sharing between existing SIMD ISAs (sse,avx,neon,...): simdjson, vectorscan, ...
This requires the ability to store the vector register state in data structures, which is currently only properly possible via the riscv_rvv_vector_bits attribute extension supported by both gcc and clang, since it requires a fixed size, known at compile time.
This attribute isn't standardized, and can only be used for a single VLEN without potentially majorly rearchitecting code structure and build system, as it depends on the -mrvv-vector-bits command line argument.

An alternative approach for implementing such abstract SIMD types is to have all operations load/store from a fixed width buffer in the structure, and rely on the compiler eliminating redundant load/stores. This avoids having to store the variable length vector register directly in the SIMD class, and allows multiple implementations of this SIMD type for different VLEN. The generated code using these types would have to be runtime dispatched based on the actual VLEN.
It could also be used to e.g. create a SIMD type that can be stored in data structures, and works for VLEN 128, 256 and 512, by making the buffer always 512-bit wide, and just not using the extra bits for VLEN<512. This isn't possible using the riscv_rvv_vector_bits attribute either, because it assumes a single fixed VLEN.

This approach is however unusable, since no current compiler is capable of eliminating redundant predicated load/stores: https://godbolt.org/z/TdajMTMKT
The actual function in the link above represents what the codegen for such an implementation currently looks like, and expected simulations what the codegen should be with redundant load/store elimination.
As you can see, even when always using vl=VLMAX no redundant load/stores are removed.
Since the RVV compiler backends aren't as mature, I also compared how the compilers handle predicated (masked) vs unpredicated AVX512 load/stores.
There you can observe, that predicated redundant AVX512 load/stores also can't be eliminated, but unpredicated ones can.

Hence, I suggest adding unpredicated RVV load/store intrinsics, aka the whole register load/stores, to help facilitate the compilers load/store elimination in this use-case.

The text was updated successfully, but these errors were encountered:

camel-cdr · 2024-10-29T18:02:00Z

I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures.

The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT

topperc · 2024-10-29T18:26:28Z

I'd like to reiterate, there is currently no usable standard way to implement fixed size SIMD abstractions, even with current compiler extensions you can only target a single VLEN per translation unit, which makes it unusable for single header libraries and requires extensive redesigning of existing library architectures.

The C code example might not be the most illustrative, so here is a C++ one: https://godbolt.org/z/TW794nxWT

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

dzaima · 2024-10-29T18:46:53Z

Some LLVM IR surgery to get good codegen, even being VLEN-agnostic: https://godbolt.org/z/jh8YK394T → https://godbolt.org/z/ePrGdzEKx

I'd imagine wouldn't be hard for a compiler to recognize & convert __riscv_vle/__riscv_vse with a constant vl into their native loads/stores similarly to how shown here.

Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't.

camel-cdr · 2024-10-29T19:01:07Z

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data.
It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz
Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

topperc · 2024-10-29T19:06:10Z

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data. It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to uint32_t *data makes the struct smaller so it fits in a GPR.

kito-cheng · 2024-10-30T07:24:36Z

The core problem you're trying to resolve is using vector intrinsics to implement some fixed-length vector functionality. Honestly, this isn't the focus of the RVV intrinsic spec (at least in version 1.0). A straightforward approach is to use GNU vectors (e.g., typedef int int32x4_t __attribute__ ((vector_size (16)));), which is well-supported by both compilers and generates good code quality. However, the issue is that all operations are supported by built-in operators in C/C++, so we eventually need to convert GNU types to RVV types, which leads to several redundant load/store operations that are hard to eliminate.

This reflects a limitation in the current compiler implementation, as it doesn't handle memory analysis for scalable vector types very effectively.

Returning to the main problem we want to solve: creating an easier programming model for SIMD-style programs while also improving code generation. One idea I have is to provide an alternative set of intrinsics to improve both user experience and code generation quality. Here’s a concrete example:

   int32x4_t a, b, c;
   a = __riscv_vle32(int32x4_t, ptr_a, 4);
   // We could also provide an overloaded version for VLMAX, e.g., __riscv_vle32(int32x4_t, ptr_a);
   // Or simply use: a = *(int32x4_t *)ptr_a;
   // ----
   b = __riscv_vle32(int32x4_t, ptr_b, 4);
   c = __riscv_vadd(int32x4_t, a, b, 4);
   // Alternative syntax: c = a + b;
   // or c = __riscv_vadd(int32x4_t, a, b);
   // ----
   __riscv_vse32(int32x4_t, ptr_c, c);
   // Or: *(int32x4_t *)ptr_c = c;

This approach was discussed in the early stages of the RVV intrinsics, but it wasn’t prioritized, so it didn’t come to fruition.

Another idea I have is to try converting scalable vector types to fixed-length vectors, which might improve code generation quality. However, this would require significant engineering effort, so upstream toolchain compilers may not consider it unless there's strong motivation.

In conclusion, I would say that introducing intrinsics for whole-register vector load/store doesn’t truly solve the issue...the real problem lies in the compiler implementation.

kito-cheng · 2024-10-30T07:26:25Z

I tried to emulate it by casting the pointer, but the code is still not great. https://godbolt.org/z/cx43oETh4

I think the problem here is that llvm is too clever and generates regular vector loads, because it knows the size of data. It can eliminate the redundant load/stores, if I change uint32_t data[16] to uint32_t *data to coax it into generating whole register load/stores in some prior lowering/optimization pass: https://godbolt.org/z/aqf8b8KGz Although it doesn't seem to be able to propagate this to subsequent function calls, see foo(), and gcc is still struggling.

The regular vector loads were generated because the struct needs to be copied indirectly for ABI reasons since it exceed 2*xlen bytes. Changing the type to uint32_t *data makes the struct smaller so it fits in a GPR.

FYI: This ABI proposal is trying to resolve this issue:

riscv-non-isa/riscv-elf-psabi-doc#418

topperc · 2024-11-08T00:42:45Z

What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b?

dzaima · 2024-11-08T02:06:57Z

What if we added builtins to convert between GNU vector_size types and RVV vector types when the GNU vector type was known to be no larger than the RVV type based on LMUL and Zvl*b?

That's what I noted as an option here before:

Additionally, I think something like https://godbolt.org/z/Mo184bTxT might be nice to allow, but currently isn't.

I don't think there necessarily needs to be a restriction on the relative size. Without the restriction, you could have a 32B or 64B buffer and do generic code over VLEN=128/256/512 (VLEN≥1024 would still work, but only use the low 512 bits), allowing good VLEN=256/512 perf while still being compatible with VLEN=128). Also noted as an idea here.

camel-cdr mentioned this issue Nov 5, 2024

vec: support RVV pytorch/pytorch#135570

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378

Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378

camel-cdr commented Oct 21, 2024

camel-cdr commented Oct 29, 2024

topperc commented Oct 29, 2024

dzaima commented Oct 29, 2024 •

edited

Loading

camel-cdr commented Oct 29, 2024

topperc commented Oct 29, 2024

kito-cheng commented Oct 30, 2024

kito-cheng commented Oct 30, 2024

topperc commented Nov 8, 2024 •

edited

Loading

dzaima commented Nov 8, 2024 •

edited

Loading

Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378

Public Review: Need for whole register (unpredicated) load/stores to facilitate compilers load/store elimination #378

Comments

camel-cdr commented Oct 21, 2024

camel-cdr commented Oct 29, 2024

topperc commented Oct 29, 2024

dzaima commented Oct 29, 2024 • edited Loading

camel-cdr commented Oct 29, 2024

topperc commented Oct 29, 2024

kito-cheng commented Oct 30, 2024

kito-cheng commented Oct 30, 2024

topperc commented Nov 8, 2024 • edited Loading

dzaima commented Nov 8, 2024 • edited Loading

dzaima commented Oct 29, 2024 •

edited

Loading

topperc commented Nov 8, 2024 •

edited

Loading

dzaima commented Nov 8, 2024 •

edited

Loading