Replies: 1 comment
-
|
cc: @efric who is looking at something similar for matvec |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Existing heuristics in IREE are largely based on simplified trip count metrics and often neglect key dimensions such as K in GEMM. This proposal introduces a more principled approach using arithmetic intensity (AI) to classify workloads and guide tile size selection.
Current Heuristic in IREE
IREE’s existing heuristic for GEMM and IGEMM derives from a simple trip‐count estimate:
iree/compiler/src/iree/compiler/Codegen/Dialect/GPU/TargetUtils/ConfigUtils.cpp
Lines 158 to 172 in 4ba2e34
Within each bucket, tile sizes are distributed evenly across M and N, minimizing global memory loads, and the K‐tile size is chosen via GCD seeds.
Limitation: This approach ignores the K dimension’s scale. For example, GEMMs of shapes
256×256×16,256×256×256, and256×256×4096all land in the “small” bucket, despite vastly different compute‐to‐memory ratios.Since IREE applies the same heuristic to convolutions via IGEMM, improvements here benefit both GEMM and convolution kernels.
Candidate Metrics for Workload Characterization
M * N)M * N * K)M*K + K*N + M*N)Using AI to Guide Workgroup Tile Sizes
A workgroup tile must balance:
AI directly measures compute per byte transferred. We partition AI into three ranges—low, medium, and high—to tailor tile shapes:
Determining Cutoff Points
We identify two AI cutoff values via percentile analysis and regression:
By fitting separate linear models on the low and high AI buckets and examining their$R^2$ scores, we pinpoint the split points. Manual adjustment ensures robustness against small sample sizes and aligns cutoffs with the target hardware’s memory‐compute crossover. I observed 0.84 and 0.33 of $R ^ 2$ score respectively in the final picked cutoff points.
Learned Strategies from Convolution Tuning
Using a training set of ~40 diverse, already-tuned convolution configurations, three dominant strategies emerged:
Combining subgroup‐favor and workgroup‐cap yields the highest uplift.
Limitations & Future Directions
Appendix:
Data Frame
Regression Script
Tuning spec processing and analysis
See scripts here.
Beta Was this translation helpful? Give feedback.
All reactions