Skip to content

Optimize op_conv_vef_face kernel #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: next
Choose a base branch
from

Conversation

pzehner
Copy link

@pzehner pzehner commented Mar 28, 2025

This PR aims to optimize the large convective kernel in src/VEF/Operateurs/Op_Conv/Op_Conv_VEF_Face.cpp. It replaces temporary arrays in local memory by Kokkos views in scratch memory.

@pledac
Copy link
Member

pledac commented Mar 28, 2025

Thanks Paul, I will have a look monday.

@pledac
Copy link
Member

pledac commented Mar 28, 2025

For Op_Conv_VEF_Face kernel, we notice between 18% and 34% speedup (Nvidia A6000) according our GPU test cases.
On H100, the speedup drops between 6% and 14%.

I merge your code into a local branch here cause the pattern is very interesting and that because now thanks to your work, we know that local static array is not using register but global memory and here replaced by faster scratch memory. The code can switch on the two implementations (with and wo scratch memory), by a TRUST_USE_SCRATCH_MEMORY environment variable to test.

@pledac
Copy link
Member

pledac commented Mar 28, 2025

I add Adrien and Rémi to discuss about the benefice/complexity ratio introduced by using scratch memory. To give an idea 30% speedup is the probable gain by using the good layout on this kernel. What bothers me, for example, is the size of the warps set here to 32. Does this value GPU specific, is it the same on AMD, and what if in 10 years with future GPU cards ? Kokkos provide portability of performance, and in my poor understanding, developer should not care about this value.

According to Hari tests, scratch memory through hierarchical memory is not interesting on other kernels (like diffusion one).

@pledac pledac self-assigned this Mar 28, 2025
@pledac
Copy link
Member

pledac commented Mar 29, 2025

On MI250X AMD, the slowdown with scratch memory is between 7% and 25% (warp size 64?).

Remove optimization of scratch memory size for order 3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants