Optimize `op_conv_vef_face` kernel #12

pzehner · 2025-03-28T11:19:17Z

This PR aims to optimize the large convective kernel in src/VEF/Operateurs/Op_Conv/Op_Conv_VEF_Face.cpp. It replaces temporary arrays in local memory by Kokkos views in scratch memory.

pledac · 2025-03-28T16:56:41Z

Thanks Paul, I will have a look monday.

pledac · 2025-03-28T17:27:55Z

For Op_Conv_VEF_Face kernel, we notice between 18% and 34% speedup (Nvidia A6000) according our GPU test cases.
On H100, the speedup drops between 6% and 14%.

And strangely, it seems slower on A100...

I merge your code into a local branch here cause the pattern is very interesting and that because now thanks to your work, we know that local static array is not using register but global memory and here replaced by faster scratch memory. The code can switch on the two implementations (with and wo scratch memory), by a TRUST_USE_SCRATCH_MEMORY environment variable to test.

pledac · 2025-03-28T17:33:56Z

I add Adrien and Rémi to discuss about the benefice/complexity ratio introduced by using scratch memory. To give an idea 30% speedup is the probable gain by using the good layout on this kernel. What bothers me, for example, is the size of the warps set here to 32. Does this value GPU specific, is it the same on AMD, and what if in 10 years with future GPU cards ? Kokkos provide portability of performance, and in my poor understanding, developer should not care about this value.

According to Hari tests, scratch memory through hierarchical memory is not interesting on other kernels (like diffusion one).

pledac · 2025-03-29T12:29:52Z

On MI250X AMD, the slowdown with scratch memory is between 7% and 25% (warp size 64?).

Remove optimization of scratch memory size for order 3.

pledac requested review from abruneton and rbourgeois33 March 28, 2025 17:28

pledac self-assigned this Mar 28, 2025

pzehner force-pushed the next-scratch branch from 25fbf73 to 447c3a4 Compare March 31, 2025 13:26

pzehner added 2 commits April 3, 2025 10:26

Use scratch memory on op_conv_vef_face kernel

9a90060

Decrease shared memory usage

ed79488

pzehner force-pushed the next-scratch branch from 447c3a4 to ed79488 Compare April 3, 2025 08:26

Only allocate scratch memory once

d2029bc

Remove optimization of scratch memory size for order 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize `op_conv_vef_face` kernel #12

Optimize `op_conv_vef_face` kernel #12

Uh oh!

pzehner commented Mar 28, 2025

Uh oh!

pledac commented Mar 28, 2025

Uh oh!

pledac commented Mar 28, 2025 •

edited

Loading

Uh oh!

pledac commented Mar 28, 2025 •

edited

Loading

Uh oh!

pledac commented Mar 29, 2025

Uh oh!

Uh oh!

Optimize op_conv_vef_face kernel #12

Are you sure you want to change the base?

Optimize op_conv_vef_face kernel #12

Uh oh!

Conversation

pzehner commented Mar 28, 2025

Uh oh!

pledac commented Mar 28, 2025

Uh oh!

pledac commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pledac commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pledac commented Mar 29, 2025

Uh oh!

Uh oh!

Optimize `op_conv_vef_face` kernel #12

Optimize `op_conv_vef_face` kernel #12

pledac commented Mar 28, 2025 •

edited

Loading

pledac commented Mar 28, 2025 •

edited

Loading