-
Notifications
You must be signed in to change notification settings - Fork 17
Optimize op_conv_vef_face
kernel
#12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
Conversation
Thanks Paul, I will have a look monday. |
For Op_Conv_VEF_Face kernel, we notice between 18% and 34% speedup (Nvidia A6000) according our GPU test cases. I merge your code into a local branch here cause the pattern is very interesting and that because now thanks to your work, we know that local static array is not using register but global memory and here replaced by faster scratch memory. The code can switch on the two implementations (with and wo scratch memory), by a TRUST_USE_SCRATCH_MEMORY environment variable to test. |
I add Adrien and Rémi to discuss about the benefice/complexity ratio introduced by using scratch memory. To give an idea 30% speedup is the probable gain by using the good layout on this kernel. What bothers me, for example, is the size of the warps set here to 32. Does this value GPU specific, is it the same on AMD, and what if in 10 years with future GPU cards ? Kokkos provide portability of performance, and in my poor understanding, developer should not care about this value. According to Hari tests, scratch memory through hierarchical memory is not interesting on other kernels (like diffusion one). |
On MI250X AMD, the slowdown with scratch memory is between 7% and 25% (warp size 64?). |
Remove optimization of scratch memory size for order 3.
This PR aims to optimize the large convective kernel in
src/VEF/Operateurs/Op_Conv/Op_Conv_VEF_Face.cpp
. It replaces temporary arrays in local memory by Kokkos views in scratch memory.