Arithmetic Intensity–Driven Heuristics for GEMM and Convolution #21506

jerryyin · 2025-07-28T16:55:19Z

jerryyin
Jul 28, 2025
Collaborator

Existing heuristics in IREE are largely based on simplified trip count metrics and often neglect key dimensions such as K in GEMM. This proposal introduces a more principled approach using arithmetic intensity (AI) to classify workloads and guide tile size selection.

Current Heuristic in IREE

IREE’s existing heuristic for GEMM and IGEMM derives from a simple trip‐count estimate:

iree/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp

Lines 158 to 172 in 4ba2e34

    
           if (mSize * nSize <= 512 * 512) { 
        
             // For matmuls with small M*N size, we want to distribute M*N onto more 
        
             // workgroups to fill the GPU. Use a smaller bestMNTileCountPerSubgroup 
        
             // and a larger bestKTileCountPerSubgroup. 
        
             seeds = {/*bestSubgroupCountPerWorkgroup=*/4, 
        
                      /*bestMNTileCountPerSubgroup=*/4, 
        
                      /*bestKTileCountPerSubgroup=*/8, 
        
                      /*bestKElementCountPerSubgroup*/ kCacheLineSizeBits / inBitWidth}; 
        
           } else { 
        
             seeds = {/*bestSubgroupCountPerWorkgroup=*/4, 
        
                      /*bestMNTileCountPerSubgroup=*/16, 
        
                      /*bestKTileCountPerSubgroup=*/4, 
        
                      /*bestKElementCountPerSubgroup*/ kCacheLineSizeBits / 2 / 
        
                          inBitWidth}; 
        
           }

Small bucket: favors smaller M/N tiles and larger K tiles.
Large bucket: favors larger M/N tiles and smaller K tiles.

Within each bucket, tile sizes are distributed evenly across M and N, minimizing global memory loads, and the K‐tile size is chosen via GCD seeds.

Limitation: This approach ignores the K dimension’s scale. For example, GEMMs of shapes 256×256×16, 256×256×256, and 256×256×4096 all land in the “small” bucket, despite vastly different compute‐to‐memory ratios.

Since IREE applies the same heuristic to convolutions via IGEMM, improvements here benefit both GEMM and convolution kernels.

Candidate Metrics for Workload Characterization

Result Size (M * N)
- Pros: Simple.
- Cons: Ignores the reduction dimension (K), a critical factor in compute.
Volume (M * N * K)
- Pros: Scales with FLOPs and tile coverage, emphasizing large workloads.
- Cons: Lacks memory‐transaction context.
Memory Footprint (M*K + K*N + M*N)
- Pros: Captures total data touched.
- Cons: Not proportional to compute intensity.
Arithmetic Intensity (AI = $\frac{2M N K}{M K + K N + M N}$)
- Pros: Encodes both compute (FLOPs) and memory (bytes moved).
- Cons: Doesn’t distinguish parallel vs. reduction axes; distinct shapes can share similar AI.

Observation: Linear regression across a sample of GEMMs shows AI best correlates with optimal tile sizes when dividing workloads into three buckets.

Using AI to Guide Workgroup Tile Sizes

A workgroup tile must balance:

Compute utilization (keeping arithmetic units busy)
Memory locality (maximizing data reuse)
Occupancy (number of concurrent workgroups)

AI directly measures compute per byte transferred. We partition AI into three ranges—low, medium, and high—to tailor tile shapes:

AI Range	Strategy
Low	Use small M/N tiles and more K tiles, smaller workgroups to alleviate memory pressure and hide latency.
Medium	Choose balanced tile shapes. Empirically, medium-AI workloads can favor either smaller or larger tiles depending on kernel details.
High	Use large M/N tiles ("fat" workgroups) and small K tiles to amortize launch/memory costs and maximize throughput.

Determining Cutoff Points

We identify two AI cutoff values via percentile analysis and regression:

Low–Medium boundary: where performance vs. AI shows strong linear growth.
Medium–High boundary: where performance plateaus and becomes compute‐bound.

By fitting separate linear models on the low and high AI buckets and examining their $R^2$ scores, we pinpoint the split points. Manual adjustment ensures robustness against small sample sizes and aligns cutoffs with the target hardware’s memory‐compute crossover. I observed 0.84 and 0.33 of $R ^ 2$ score respectively in the final picked cutoff points.

Learned Strategies from Convolution Tuning

Using a training set of ~40 diverse, already-tuned convolution configurations, three dominant strategies emerged:

Subgroup‐favor: Increase the number of subgroups (per workgroup) with smaller M/N tiles and moderate K tiles. Improves memory‐bound kernels.
M/N‐tile‐favor: Allocate larger M/N tiles and smaller K tiles. Targets compute‐bound kernels.
Workgroup‐cap: Ensure at least one workgroup per compute unit. If occupancy drops below the GPU’s CU count, reduce M/N tile size to spawn more workgroups.

Validation (~500 convolutions vs. baseline):

Subgroup‐favor improved 33% of configs, degraded 10% configs. Geo mean improved by 8.6%.

M/N‐tile‐favor improved 23% of configs, degraded 12% configs. Geo mean stay the same.

Workgroup‐cap improved 38% of configs, degraded 9% configs. Geo mean improved by 16.4%.

Combining subgroup‐favor and workgroup‐cap yields the highest uplift.

Limitations & Future Directions

Hardware: This study is carried upon bf16 type on MI300. Different data types on different hardware can change the cutoff. The actual cutoff needs to be computed according to peak performance and memory bandwidth.
Data sparsity: Current tuning set may not cover edge cases; broader sampling needed.
Medium‐AI kernels: AI alone insufficient; new features (e.g., cache‐line utilization) could refine this group.
Thread‐count strategies: All three heuristics adjust tile shapes, not thread counts; exploring thread‐level scheduling could unlock more gains.

Appendix:

Data Frame

display(df.head())

index	name	tripcount_heuristic	tripcount_tuned	tuned/CU(304)	input_shape	filter_shape	gemmM	gemmN	gemmK	workgroup_size	AI
0	conv_2d_bfloat16_forward_16x96x64x48_nhwc_48x5x5x48_fhwc_nhwf_1x1s_8x8p_4x4d_1g	288	1536	5.052631579	16x96x64x48	48x5x5x48	98304	48	1200	3072.0	92.26437407
1	conv_2d_bfloat16_forward_16x48x32x576_nhwc_576x3x3x576_fhwc_nhwf_1x1s_1x1p_1x1d_1g	864	1728	5.684210526	16x48x32x576	576x3x3x576	24576	576	5184	8192.0	1015.38179
2	conv_2d_bfloat16_forward_16x48x32x768_nhwc_2048x3x3x768_fhwc_nhwf_1x1s_1x1p_1x1d_1g	6144	768	2.526315789	16x48x32x768	2048x3x3x768	24576	2048	6912	65536.0	2968.912752
3	conv_2d_bfloat16_forward_16x225x225x5_nhwc_64x3x3x5_fhwc_nhwf_1x1s_1x1p_1x1d_1g	6750	6750	22.20394737	16x225x225x5	64x3x3x5	810000	64	45	7680.0	52.84231299
4	conv_2d_bfloat16_forward_16x450x450x4_nhwc_16x2x2x4_fhwc_nhwf_2x2s_0x0p_1x1d_1g	3375	6000	19.73684211	16x450x450x4	16x2x2x4	3240000	16	16	8640.0	15.99996049
...	...

Regression Script

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Compute AI buckets
x_cutoffs = df['AI'].quantile([0.3, 0.8])
df['AI_bucket'] = pd.cut(df['AI'], bins=[-inf, *x_cutoffs, inf], labels=['low','mid','high'])

# Fit and report per‐bucket linear models
for b in ['low','mid','high']:
    sub = df[df['AI_bucket']==b]
    X, y = sub[['AI']], sub['workgroup_size']
    m = LinearRegression().fit(X, y)
    print(f"{b}: size = {m.coef_[0]:.2f}*AI + {m.intercept_:.1f}, R²={r2_score(y, m.predict(X)):.3f}")

Tuning spec processing and analysis

See scripts here.

kuhar · 2025-07-28T23:14:47Z

kuhar
Jul 28, 2025
Maintainer

cc: @efric who is looking at something similar for matvec

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Arithmetic Intensity–Driven Heuristics for GEMM and Convolution #21506

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Arithmetic Intensity–Driven Heuristics for GEMM and Convolution #21506

Uh oh!

Uh oh!

jerryyin Jul 28, 2025 Collaborator

Current Heuristic in IREE

Candidate Metrics for Workload Characterization

Using AI to Guide Workgroup Tile Sizes

Determining Cutoff Points

Learned Strategies from Convolution Tuning

Limitations & Future Directions

Appendix:

Data Frame

Regression Script

Tuning spec processing and analysis

Replies: 1 comment

Uh oh!

kuhar Jul 28, 2025 Maintainer

jerryyin
Jul 28, 2025
Collaborator

kuhar
Jul 28, 2025
Maintainer