cute::warpgroup_wait<0> not needed between cute::gemm() and scaling accumulators in sm90 gemm kernel? #2190

chenfucn · 2025-03-23T04:51:15Z

chenfucn
Mar 23, 2025

Looking at the block scale kernel that applies the scale to wmma accumulators:

https://github.com/NVIDIA/cutlass/blob/ca4fdbea708ad940c905359788372b8add9f85e0/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp#L657C1-L681C1

      warpgroup_arrive();
      // Unroll the K mode manually to set scale D to 1
      CUTLASS_PRAGMA_UNROLL
      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
        // (V,M,K) x (V,N,K) => (V,M,N)
        cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accumulation());
        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
      }
      warpgroup_commit_batch();

      // Block scale the accumulators with reg tensor `tCrScaleAViewAsC` and `tCrScaleBViewAsC`
      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile == 1) {
        ElementBlockScale scale_ab = tCrScaleAViewAsC.data()[0];
        scale_if_needed(accumulation, scale_ab);
      }
      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile == 1) {
        scale_if_needed(accumulation, tCrScaleAViewAsC);
      }
      if constexpr (ScaleMsPerTile == 1 && ScaleNsPerTile  > 1) {
        scale_if_needed(accumulation, tCrScaleBViewAsC);
      }
      if constexpr (ScaleMsPerTile  > 1 && ScaleNsPerTile  > 1) {
        scale_if_needed(accumulation, tCrScaleAViewAsC, tCrScaleBViewAsC);
      }

Looks like a bunch of cute::gemm() (async gemm) instructions are issued at line 694, and then scales are applied at line 700-711. This puzzles me since as for my understanding, the just issued async gemm might be still on the fly at this point, so the accumulator registers may still contain the old value. Why is it safe to grab these accumulators, time a scale and assign (promote) them to another copy?

chenfucn · 2025-03-26T16:23:57Z

chenfucn
Mar 26, 2025
Author

sorry, missed the part warpgroup_wait<0>() hidden in accumulation.scale_if_needed

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cute::warpgroup_wait<0> not needed between cute::gemm() and scaling accumulators in sm90 gemm kernel? #2190

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

cute::warpgroup_wait<0> not needed between cute::gemm() and scaling accumulators in sm90 gemm kernel? #2190

chenfucn Mar 23, 2025

Replies: 1 comment

chenfucn Mar 26, 2025 Author

chenfucn
Mar 23, 2025

chenfucn
Mar 26, 2025
Author