Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans for RISC-V Vector Extension (RVV) optimization? #11063

Closed
joy2myself opened this issue Oct 21, 2024 · 8 comments
Closed

Any plans for RISC-V Vector Extension (RVV) optimization? #11063

joy2myself opened this issue Oct 21, 2024 · 8 comments

Comments

@joy2myself
Copy link

Feature description

First off, thanks for all the amazing work on GDAL! I wanted to ask if there are any plans to optimize GDAL for the RISC-V platform, specifically using the RISC-V Vector Extension (RVV). With RISC-V gaining popularity, having RVV optimizations could potentially bring performance benefits to GDAL on that platform.

If there’s no plan yet, would this be something you’d consider? My team and I would be interested in contributing if there’s a need for testing or development in this area.

Thanks!

Additional context

No response

@rouault
Copy link
Member

rouault commented Oct 22, 2024

Hi, thanks for your interest. May I ask you what's your interest in GDAL and/or RISC-V ? Perhaps you're affiliated with a RISC-V founder or some group that promotes for its adoption?
I ran an informal poll on my Mastodon account in https://mastodon.social/@EvenRouault/113344940167220826. So 22 people responded: 0% use RISC-V currently, 9% might and 91% will presumably never.
I would be really reluctant to having RISC-V specific code paths in our code paths:

  • it would be really hard for GDAL maintainers to ensure they are correct, either in the initial implementation, and more critically within the 5, 10 or 15 next years, due to the absence of access to that hardware, either locally or with continuous integration platforms as provide by GitHub workflows
  • platform specfic code paths that rote could means code that no longer compiles, or falls behind bugfixes, so a much worse situation than the current one where, if C/C++ compilers for RISC-V work correctly, users at least get correct results.
  • there's the issue of how to make them evolve when the base portable code path evolve, and that happens regularly like in Warper: fix shifted/aliased result with Lanczos resampling when XSCALE < 1 or YSCALE < 1 #11046

I would be much more supportive of RISC-V optimizations going through the use of an abstraction software layer. I see that libjxl uses https://github.com/google/highway and that it has RISC-V support. That would also enable us to cover other platforms like NEON / ARM64.

Currently we have a few specific SSE/SSE2/AVX2 code paths using Intel intrinsics, either directly, or through a thin abstraction layer such as gcore/gdalsse_priv.h. I'm undecided if adopting highway would totally deprecate those code paths, or if we would keep them. It all depends if we can reach the same level of performance, and also how we deal with the external dependency.

The main candidates for accelerated code paths are alg/gdalwarpkernel.cpp, gcore/overview.cpp and CopyWord related code of gcore/rasterio.cpp

@joy2myself
Copy link
Author

Hi @rouault,

Thank you for the detailed response! Let me introduce myself first—I’m Yin Zhang (张尹), from the Programming Language and Compilation Technology Lab (PLCT Lab) at the Intelligent Software Research Center, Institute of Software, Chinese Academy of Sciences. We are members of the RISC-V Foundation and actively involved in promoting its development. Additionally, we have some non-public projects that would benefit from using GDAL on the RISC-V platform, where performance is a key concern.

I personally have experience in various SIMD and vector-related optimizations, including RISC-V vector optimizations for OpenCV (https://github.com/opencv/opencv/commits/4.x/?author=joy2myself). I’m also working on the implementation of the <experimental/simd> header for the libc++ standard library (https://github.com/llvm/llvm-project/commits/main/?author=joy2myself). I fully understand your caution regarding platform-specific code. If adding RISC-V specific code paths is not desirable for the GDAL upstream, we may consider maintaining a downstream fork to suit our project needs.

Alternatively, we could discuss potential frameworks for upstream optimizations in GDAL. Based on my experience in the SIMD field, I see three primary approaches for SIMD optimizations in most foundational libraries:

  1. Platform-specific code: This involves using native intrinsics or embedded assembly for each SIMD instruction set. While this approach offers the best performance, it lacks portability, requiring multiple versions of the optimized code for different platforms.
  2. Unified abstraction layers: Libraries like Google Highway or the C++ <experimental/simd> header provide unified SIMD abstractions. These layers are portable across platforms and easy to use. However, this approach often requires sacrificing certain platform-specific features to ensure a unified and generic abstraction interface. As a result, it is generally not possible to achieve the highest performance in all use cases and across all target platforms. They may also introduce external dependencies.
  3. Custom hardware acceleration layer: Similar to OpenCV’s universal intrinsics, this approach involves designing a custom abstraction layer for the specific algorithms in the library, and then optimizing each platform by platform-specific code individually. This offers both portability and high performance, but it requires significant resources to develop and maintain the custom abstraction layer. Additionally, such a layer may be tailored to the needs of a specific library and might not be as generic as other SIMD abstraction solutions.

Each approach has its pros and cons, and the choice often depends on the specific needs and practical circumstances of the project. Of course, you are far more familiar with the specific requirements and real-world conditions of GDAL than I am.

Looking forward to hearing your thoughts!

Best,
Yin Zhang

@rouault
Copy link
Member

rouault commented Oct 24, 2024

2. Unified abstraction layers: Libraries like Google Highway or the C++ <experimental/simd> header provide unified SIMD abstractions.

My own inclination would go to that. Whether which approach is the preferred one would be to be determined. Is experimental/simd a sort of staging area for evolutions of the C++ standard/library. What is the status of this? The GDAL project is rather conservative and I don't think we would want to adopt a C++ feature that hasn't been officially adopted and has at least one implementation. Perhaps the topic is not mature enough yet to be considered for GDAL too.

Platform-specific code would fall for me in the https://gdal.org/en/latest/development/rfc/rfc85_policy_code_additions.html category. The GDAL project has unfortunately seen a lot of contributors over time "dump" their code to upstream and run away afterwards, leading to even more works for maintainers.

Any choice should probably go through the RFC route: https://gdal.org/en/latest/development/rfc/index.html

Custom hardware acceleration layer

I had initiated a very primitive sort of that with gcore/gdalsse_priv.h, but this is more as a convenient way of using SSE intrinsincs with C++ than intended to be cross architecture abstraction layer. Other libs such as Highway, xsimd, etc. have likely done a much better job at this.

@joy2myself
Copy link
Author

Hi @rouault,

Regarding the status of <experimental/simd>, yes, I think it can be understood as a staging area for evolutions of the C++ standard. Indeed, as the name suggests, <experimental/simd> is currently under the experimental namespace, reflecting its development stage. Once it matures, it will likely be moved to the std::simd namespace for standardized usage. At present, there is a usable implementation of <experimental/simd> in the libstdc++ library of the GCC compiler (starting from GCC 11.2 and above, you can directly include the header and use it). And I am currently working on another implementation within the LLVM/clang libc++ library.

I fully understand the upstream position regarding platform-specific code. After internal discussions with my team, we will carefully evaluate and determine our plan. There seem to be two possible directions at this point:

  • One option is to leverage highway for optimizations. In this case, we could submit an RFC to the upstream community and push forward with the optimization implementation, while also using the optimized version to meet our project needs.
  • Another option would be to maintain a downstream RISC-V specific optimized version ourselves to fulfill our project requirements, temporarily shelving any plans to submit such code upstream.

Thank you again for your detailed and thoughtful response. It has been very helpful in shaping our direction.

@rouault
Copy link
Member

rouault commented Nov 5, 2024

FYI, in #11202 , I've used the sse2neon.h header that works very well. Not sure if there's a similar sse2rvv.h ;-)

@camel-cdr
Copy link

camel-cdr commented Nov 5, 2024

FYI, in #11202 , I've used the sse2neon.h header that works very well. Not sure if there's a similar sse2rvv.h ;-)

There is: sse2rvv and neon2rvv

But I wouldn't recommend using them for more than a quick initial port, because they don't allow you to take advantage of the full vector length. You'd be better of using something like highway or potentially std::simd, which allow you to write vector length agnostic generic SIMD.

From what I've seen of the codebase, I would recommend successively adding custom RVV codepaths, because the SIMD usage seems to be mostly in isolated places.

due to the absence of access to that hardware, either locally or with continuous integration platforms as provide by GitHub workflows

Some RVV 1.0 hardware is already available, see "Processors with RVV 1.0": https://camel-cdr.github.io/rvv-bench-results/index.html

You can just use qemu in the Github CI. That's even better than real hardware, because you can configure it to use different vector length and adjust some other implementation details.

or falls behind bugfixes

Yeah, that could happen if you don't have capacity to maintain it. Hopefully problems would get caught if tests are run by the CI.

See for example the RVV support that now is in gnuradio/volk for an example CI setup.

@rouault
Copy link
Member

rouault commented Nov 5, 2024

You'd be better of using something like highway or potentially std::simd, which allow you to write vector length agnostic generic SIMD.

I don't know RVV specifics, but for Intel, for SSE2 vs AVX2, in the few times I've compared in GDAL, the AVX2 boost is far from being twice the SSE2 one. For example in gcore/statistics.txt, I mention that the boost of AVX2 vs SSE2 is just 15%. But yes if you have some abstraction of the vector length, you can get that "for free".

From what I've seen of the codebase, I would recommend successively adding custom RVV codepaths, because the SIMD usage seems to be mostly in isolated places.

Did you identify specific places where that would be beneficial ? The measured runtime speed enhancement vs implementation & maintenance cost ratio to assess case by case.

@camel-cdr
Copy link

I don't know RVV specifics, but for Intel, for SSE2 vs AVX2, in the few times I've compared in GDAL, the AVX2 boost is far from being twice the SSE2 one. For example in gcore/statistics.txt, I mention that the boost of AVX2 vs SSE2 is just 15%. But yes if you have some abstraction of the vector length, you can get that "for free".

The difference for RVV should be larger, because x86 CPUs try to make SSE still fast, because of the legacy code, while RVV implementations tend to not specifically optimize below their vector length.

Did you identify specific places where that would be beneficial ? The measured runtime speed enhancement vs implementation & maintenance cost ratio to assess case by case.

No I don't, because I didn't know about this project before I found this issue. I just wanted to suggest how I'd approach adding RVV optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants