Replies: 2 comments 3 replies
-
|
cc: @qiao-bo @turbo0628 |
Beta Was this translation helpful? Give feedback.
-
|
Hi, thanks for raising this interesting discussion. I don't think it is a best-practice to manually put each I locally tested some simple kernels such as: and they have the same performance and also the almost identical PTX code generated. I think this kind of performance difference needs to be analyzed case-by-case. In your previous p2g example, the changes involve more than splitting into more kernels. Might need some further investigation. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
In this other discussion, I had made some changes to one of the top-level
forloops within ati.kernelthat I had expected would cause a big improvement in performance (using gathered reads instead of scattered atomic writes). However, this change actually caused a huge degradation in performance that I could not understand. After being stuck on it for many weeks, in the process of debugging and profiling I tried splitting thatti.kernelinto two kernels, one for each of the top-levelforloops in the original kernel.This provided the performance boost that I had originally expected to see, but it also surprised me very greatly based on how I thought the Taichi compiler works, and it causes me to wonder what is the best practice regarding how many top-level
forloops ati.kernelshould contain.For some reason I had thought that each
forloop in ati.kernelshould map to one or more true GPU kernels and that eachforloop in ati.kernelshould therefore compile essentially independently -- i.e., that as far as the Taichi compiler is concerned, it should make no difference whether you put all of yourforloops in one hugeti.kernelor put eachforloop into its ownti.kernel.This is apparently not correct in the case I encountered, so I wonder:
(1) is it possible that I somehow stumbled onto an edge-case bug in the Taichi compiler, or is it expected that splitting top level
forloops into separateti.kernelscan have this kind of impact on performance?(2) If this is expected, I also wonder: should it be considered a best-practice to manually put each
forloop into it's ownti.kernel? Does doing that minimize the size of the code that the Taichi compiler has to analyze and optimize when compiling theti.kernel, and therefore it is always better? (I noticed that most of the examples that I've looked at in the Taichi repo as well as the Taich-Elements project seem to put several top-levelforloops within ati.kernel, so it doesn't appear the experts in the Taichi team split up kernels this way)(3) Are there any times when it is definitely better for performance to keep multiple top level
forloops in a singleti.kernel?Thank you very much for your insights and suggestions on this topic!
Beta Was this translation helpful? Give feedback.
All reactions