[Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles by aidando73 · Pull Request #2946 · NVIDIA/cutlass

aidando73 · 2026-01-10T20:06:48Z

I'm getting this error trying to generate e4m3 fp8 kernels for SM100:

/home/aidan/fireworks/cutlass/include/cute/int_tuple.hpp(408): error: static assertion failed with "Divisibility Condition"
        static_assert(((IntTupleA::value % IntTupleB::value) == 0) or ((IntTupleB::value % IntTupleA::value) == 0), "Divisibility Condition");
        ^
          detected during:
            instantiation of "auto cute::shape_div(const IntTupleA &, const IntTupleB &) [with IntTupleA=cute::_136, IntTupleB=cute::_64]" at line 391
            instantiation of function "lambda [](const auto &, const auto &)->auto [with <auto-1>=cute::_136, <auto-2>=cute::_64]" at line 109 of /home/aidan/fireworks/cutlass/include/cute/algorithm/tuple_algorithms.hpp
            instantiation of "auto cute::detail::tapply(T0 &&, T1 &&, F &&, G &&, cute::seq<I...>) [with T0=const cute::tuple<cute::_64, cute::_136> &, T1=const cute::tuple<cute::C<64>, cute::C<64>> &, F=lambda [](const auto &, const auto &)->auto &, G=lambda [](const auto &...)->auto, I=<0, 1>]" at line 225 of /home/aidan/fireworks/cutlass/include/cute/algorithm/tuple_algorithms.hpp
            instantiation of "auto cute::transform(const T0 &, const T1 &, F &&) [with T0=cute::tuple<cute::_64, cute::_136>, T1=cute::tuple<cute::C<64>, cute::C<64>>, F=lambda [](const auto &, const auto &)->auto]" at line 391
            instantiation of "auto cute::shape_div(const IntTupleA &, const IntTupleB &) [with IntTupleA=cute::tuple<cute::_64, cute::_136>, IntTupleB=cute::tuple<cute::C<64>, cute::C<64>>]" at line 1499 of /home/aidan/fireworks/cutlass/include/cutlass/epilogue/collective/builders/sm100_builder.inl
            instantiation of class "cutlass::epilogue::collective::CollectiveBuilder<cutlass::arch::Sm100, OpClass, MmaTileShape_MNK, ClusterShape_MNK, EpilogueTileType, ElementAccumulator, ElementCompute, ElementC_, GmemLayoutTagC_, AlignmentC, ElementD, GmemLayoutTagD, AlignmentD, EpilogueScheduleType, FusionOpOrCallbacks, std::enable_if_t<<expression>, void>> [with OpClass=cutlass::arch::OpClassTensorOp, MmaTileShape_MNK=cute::tuple<cute::_64, cute::_136, cute::_128>, ClusterShape_MNK=cute::tuple<cute::_1, cute::_1, cute::_1>, EpilogueTileType=cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator=float, ElementCompute=float, ElementC_=void, GmemLayoutTagC_=cutlass::layout::ColumnMajor, AlignmentC=1, ElementD=cutlass::bfloat16_t, GmemLayoutTagD=cutlass::layout::ColumnMajor, AlignmentD=1, EpilogueScheduleType=cutlass::epilogue::NoSmemWarpSpecialized1Sm, FusionOpOrCallbacks=cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, void, float, cutlass::FloatRoundStyle::round_to_nearest>]" at line 1501 of /home/aidan/fireworks/cutlass/include/cutlass/epilogue/collective/builders/sm100_builder.inl
            instantiation of class "cutlass::epilogue::collective::CollectiveBuilder<cutlass::arch::Sm100, OpClass, MmaTileShape_MNK, ClusterShape_MNK, EpilogueTileType, ElementAccumulator, ElementCompute, ElementC_, GmemLayoutTagC_, AlignmentC, ElementD, GmemLayoutTagD, AlignmentD, EpilogueScheduleType, FusionOpOrCallbacks, std::enable_if_t<<expression>, void>> [with OpClass=cutlass::arch::OpClassTensorOp, MmaTileShape_MNK=cute::tuple<cute::_64, cute::_136, cute::_128>, ClusterShape_MNK=cute::tuple<cute::_1, cute::_1, cute::_1>, EpilogueTileType=cutlass::epilogue::collective::EpilogueTileAuto, ElementAccumulator=float, ElementCompute=float, ElementC_=void, GmemLayoutTagC_=cutlass::layout::ColumnMajor, AlignmentC=1, ElementD=cutlass::bfloat16_t, GmemLayoutTagD=cutlass::layout::ColumnMajor, AlignmentD=1, EpilogueScheduleType=cutlass::epilogue::NoSmemWarpSpecialized1Sm, FusionOpOrCallbacks=cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, void, float, cutlass::FloatRoundStyle::round_to_nearest>]" at line 46 of /home/aidan/fireworks/tools/library/generated/gemm/100/void_gemm_e4m3/cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_bf16_64x136x128_1x1x1_0_tnn_align8_cpasync_1sm_epi_nosmem.cu

This PR fixes it - relevant line is here: https://github.com/aidando73/cutlass-1/blob/a1dfe3f4935d80726c95bae8a56c2a2c5280e73d/include/cutlass/epilogue/collective/builders/sm100_builder.inl#L1499

E.g., if CtaTileShape_MNK: (64, 136, _) and EpilogueTile: (64, 64) then this assert fails:

https://github.com/aidando73/cutlass-1/blob/a1dfe3f4935d80726c95bae8a56c2a2c5280e73d/include/cute/int_tuple.hpp#L408

And since EpilogueTile[1] is min(64, cta_n) - there's two cases:

If cta_n <= 64, EpilogueTile[1] == CtaTileShape_MNK[1] -> assert passes
If cta_n > 64, EpilogueTile[1]=64, thus CtaTileShape_MNK[1] must be divisible by 64

Repro command:

python $FIREWORKS_DIR/third-party/cutlass/python/cutlass_library/generator.py \
  --operations gemm \
  --architectures "100f" \
  --kernels "cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_bf16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_bf16_bf16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_f16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_f16_f16_*_tnn_*" \
  --selected-kernel-list all_kernels.txt \
  --instantiation-level "max" \
  --cuda-version "12.8.0" \
  --disable-cutlass-package-imports 2>&1 | tee all_kernels.log

cmake $FIREWORKS_DIR/third-party/cutlass \
  -DCUTLASS_NVCC_ARCHS="100f" \
  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_bf16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_bf16_bf16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_f16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_f16_f16_*_tnn_*" \
  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
  -DCUTLASS_LIBRARY_EXCLUDE_KERNELS="" \
  -DCUTLASS_UNITY_BUILD_ENABLED=ON 2>&1 | tee cmake.log && \
VERBOSE=1 make cutlass_profiler -j255 2>&1 | tee output.log

aidando73 · 2026-01-10T20:14:35Z

cc @hwu36 @depaulmillz

CalebDu · 2026-01-15T03:25:34Z

Hello @aidando73, good job. if you enable all mma instruction size by specifyingDCUTLASS_LIBRARY_INSTANTIATION_LEVEL, there is potential compilation failed because default epilogue tile (EpilogueTileAuto)may not fit irregular mma size.
But for different epilogue data type, default epilogue tile N is also different, not always equal to 64, maybe 32.

cutlass/include/cutlass/epilogue/collective/builders/sm100_builder.inl

Lines 1008 to 1038 in 8debf77

    
           constexpr int N_perf = [&]() constexpr { // Known subtile sizes tested for perf 
        
             // Epilogues w/o residual load are less sensitive to smem allocation 
        
             // Target a fixed amount of compute per epilogue iteration 
        
             if (DisableSource) { 
        
               if (MaxBits == 4) { 
        
                 // Make epilogue tile larger to reduce the epilogue iterations. 
        
                 // 64 is the experimental value. It will minimize epilogue iterations but keep the number of A/B buffers the same. 
        
                 constexpr int ComputeElts = 8192; 
        
                 return ComputeElts / M; 
        
               } 
        
               constexpr int ComputeElts = 4096; 
        
               return ComputeElts / M; 
        
             } 
        
             // Epilogues w/ residual load are more sensitive to smem allocation 
        
             // Target optimal smem distribution between epilogue+mainloop based on datatype+tilesize 
        
             else { 
        
               if (MaxBits == 32) { 
        
                 return (CtaM > 64 && CtaN <= 128) ? 16 : 32; 
        
               } 
        
               // Per-column scaling is high register pressure, reduce tile to prevent spills 
        
               else if (IsPerColScaleSupported) { 
        
                 return 32; 
        
               } 
        
               else if (MaxBits == 16) { 
        
                 return (CtaN <= 128) ? 32 : 64; 
        
               } 
        
               else { 
        
                 return 64; 
        
               } 
        
             } 
        
           }();

So It not reasonable to ignore cta_n % 64 != 0 directly for other data dtype. It may cause other data dtype kernel instantiation failed.

      if cta_n > 64 and (cta_n % 64 != 0):
        continue

A good solution is to add a new if branch to check cta_n and c/d_dtype together in following code.

cutlass/python/cutlass_library/generator.py

Lines 7603 to 7606 in 8debf77

    
           for data_type in data_types: 
        
             if ( data_type["a_type"] == DataType.e4m3 ) and ( data_type["b_type"] == DataType.e4m3 ) and\ 
        
                ( data_type["d_type"] == DataType.e5m2 ): 
        
               continue

aidando73 · 2026-01-16T17:40:14Z

@CalebDu thanks for the review

A good solution is to add a new if branch to check cta_n and c/d_dtype together in following code.

Ok updated - only going to apply to C=void D=bf16/f16 for now.

Tested with:

python $FIREWORKS_DIR/third-party/cutlass/python/cutlass_library/generator.py \
  --operations gemm \
  --architectures "100f" \
  --kernels "cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_bf16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_f16_*_tnn_*" \
  --selected-kernel-list all_kernels.txt \
  --instantiation-level "max" \
  --cuda-version "12.8.0" \
  --disable-cutlass-package-imports 2>&1 | tee all_kernels.log

cmake $FIREWORKS_DIR/third-party/cutlass \
  -DCUTLASS_NVCC_ARCHS="100f" \
  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_bf16_*_tnn_*,cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_void_f16_*_tnn_*" \
  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
  -DCUTLASS_LIBRARY_EXCLUDE_KERNELS="" \
  -DCUTLASS_UNITY_BUILD_ENABLED=ON 2>&1 | tee cmake.log && \
VERBOSE=1 make cutlass_profiler -j255 2>&1 | tee output.log

I believe we'll need this for C=bf16/f16 as well - but I run into a different error:

/home/aidando/fireworks/third-party/cutlass/include/cutlass/gemm/collective/sm100_mma_warpspecialized.hpp(188): error: static assertion failed with "Specialization requires Stages set to value 1 or more."
    static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 1 or more.");
    ^
          detected during instantiation of class "cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100TmaUmmaWarpSpecialized<Stages, SchedulerPipelineStageCount, AccumulatorPipelineStageCount, ClusterShape>, TileShape_, ElementA_, StrideA_, ElementB_, StrideB_, TiledMma_, GmemTiledCopyA_, SmemLayoutAtomA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyB_, SmemLayoutAtomB_, SmemCopyAtomB_, TransformB_> [with Stages=1, SchedulerPipelineStageCount=2, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple<int, int, cute::_1>, TileShape_=cute::tuple<cute::_256, cute::_224, cute::_128>, ElementA_=cutlass::float_e4m3_t, StrideA_=cute::tuple<int64_t, cute::C<1>, int64_t>, ElementB_=cutlass::float_e4m3_t, StrideB_=cute::tuple<int64_t, cute::C<1>, int64_t>, TiledMma_=cute::TiledMMA<cute::MMA_Atom<cute::MMA_Traits<cute::SM100_MMA_F8F6F4_2x1SM_SS, cutlass::float_e4m3_t, cutlass::float_e4m3_t, float, cute::C<256>, cute::C<224>, cute::integral_constant<cute::UMMA::Major, cute::UMMA::Major::K>, cute::integral_constant<cute::UMMA::Major, cute::UMMA::Major::K>, cute::integral_constant<cute::UMMA::ScaleIn, cute::UMMA::ScaleIn::One>, cute::integral_constant<cute::UMMA::ScaleIn, cute::UMMA::ScaleIn::One>>>, cute::Layout<cute::tuple<cute::_1, cute::_1, cute::_1>, cute::tuple<cute::_0, cute::_0, cute::C<0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, GmemTiledCopyA_=cute::SM100_TMA_2SM_LOAD_MULTICAST, SmemLayoutAtomA_=cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::_8, cute::_128>, cute::tuple<cute::_128, cute::_1>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyB_=cute::SM100_TMA_2SM_LOAD_MULTICAST, SmemLayoutAtomB_=cute::ComposedLayout<cute::Swizzle<3, 4, 3>, cute::smem_ptr_flag_bits<8>, cute::Layout<cute::tuple<cute::_8, cute::_128>, cute::tuple<cute::_128, cute::_1>>>, SmemCopyAtomB_=void, TransformB_=cute::identity]" at line 72 of /home/aidando/fireworks/third-party/cutlass/tools/library/generated/gemm/100/f16_gemm_e4m3/cutlass3x_sm100_tensorop_gemm_e4m3_e4m3_f32_f16_f16_256x224x128_0x0x1_0_tnn_align16_2sm.cu

1 error detected in the compilation of "/home/aidando/fireworks/third-party/cutlass/tools/library/cutlass_library_gemm_sm100_bf16_gemm_e4m3_objs.unity.367cb6d1c2ab.cu".

So I will keep it scoped only to C=void for now and revisit this later.

CalebDu · 2026-01-19T05:35:52Z

@Junkai-Wu @hwu36 LGTM.

aidando73 added 4 commits January 10, 2026 20:05

.

4508b22

.

23c6005

.

5565347

.

b352fa8

aidando73 mentioned this pull request Jan 10, 2026

[QST] FP8 Blockwise GEMM worse than fp16 case #2923

Open

aidando73 added 2 commits January 16, 2026 17:27

.

b4c1d42

.

c2174ae

.

e735f62

Junkai-Wu approved these changes Jan 20, 2026

View reviewed changes

Junkai-Wu merged commit 3f5bafb into NVIDIA:main Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles#2946

[Cutlass profiler] Fix SM100 FP8 nosmem epilogue shape_div “Divisibility Condition” for non‑multiple‑of‑64 N tiles#2946
Junkai-Wu merged 7 commits intoNVIDIA:mainfrom
aidando73:aidand-fix-e4m3-cutlass-profiler

aidando73 commented Jan 10, 2026 •

edited

Loading

Uh oh!

aidando73 commented Jan 10, 2026

Uh oh!

CalebDu commented Jan 15, 2026

Uh oh!

aidando73 commented Jan 16, 2026 •

edited

Loading

Uh oh!

CalebDu commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aidando73 commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aidando73 commented Jan 10, 2026

Uh oh!

CalebDu commented Jan 15, 2026

Uh oh!

aidando73 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CalebDu commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aidando73 commented Jan 10, 2026 •

edited

Loading

aidando73 commented Jan 16, 2026 •

edited

Loading