Replies: 1 comment
-
sorry, missed the part warpgroup_wait<0>() hidden in accumulation.scale_if_needed |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Looking at the block scale kernel that applies the scale to wmma accumulators:
https://github.com/NVIDIA/cutlass/blob/ca4fdbea708ad940c905359788372b8add9f85e0/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp#L657C1-L681C1
Looks like a bunch of cute::gemm() (async gemm) instructions are issued at line 694, and then scales are applied at line 700-711. This puzzles me since as for my understanding, the just issued async gemm might be still on the fly at this point, so the accumulator registers may still contain the old value. Why is it safe to grab these accumulators, time a scale and assign (promote) them to another copy?
Beta Was this translation helpful? Give feedback.
All reactions