[QST] How to Let __launch_bounds__
and setmaxnreg
Work with Each Other?
#2007
Labels
__launch_bounds__
and setmaxnreg
Work with Each Other?
#2007
Background
In a dense fp16 GeMM on H800, I have: tile size
192x128
, 3 warpgroups with WG1&2 be the cooperative consumer warpgroups.In detail, the producer wapgroup uses
cutlass::arch::warpgroup_reg_dealloc<24>();
while the consumers usecutlass::arch::warpgroup_reg_alloc<232>();
to set warpgroup-level reg count hint.During compilation, the compiler shows
ptxas info : Used 122 registers
. And the kernel runs well.What is your question?
Based on that, I add a
__launch_bounds__(384, 1)
hint for the kernel, the compiler showsptxas info : Used 168 registers
which is kind of expected.However, after launching the kernel, it hangs at
cutlass::arch::warpgroup_reg_alloc<232>();
and some warpgroups cannot proceed and thewgmma
cannot be issued.Another thing is that when I change the consumer reg count to
cutlass::arch::warpgroup_reg_alloc<168>();
, the kernel runs well. But if I increase this value, the kernel hangs.The strange thing is that, I found that FA3 https://github.com/Dao-AILab/flash-attention/blob/0dfb28174333d9eefb7c1dd4292690a8458d1e89/hopper/flash_fwd_kernel.h#L28 also uses this method.
How to understand such behavior? Can we dump more info during compilation?
The text was updated successfully, but these errors were encountered: