Skip to content

Conversation

@jtuyls
Copy link
Contributor

@jtuyls jtuyls commented Oct 30, 2025

As part of trying to enable the tensor ukernels flag by default (#22318), it was discovered that some ukernels perform worse on some matmul benchmarks, resulting in regressions: https://github.com/iree-org/iree/actions/runs/18651161302/job/53169844154?pr=22318.

Through experimentation I found that the large f16 data-tiling ukernel only starts performing well from M==512, N==+-32832, K==+-512.

Benchmark Results (K=4096, N=16384)

M No UKernels All UKernels
4 0.739 ms 0.832 ms
8 0.752 ms 0.782 ms
16 0.700 ms 0.731 ms
32 0.700 ms 0.737 ms
64 0.718 ms 0.823 ms
128 0.870 ms 0.765 ms
256 0.701 ms 0.853 ms
512 0.754 ms 1.02 ms
1024 0.988 ms 0.949 ms
2048 1.07 ms 1.5 ms
4096 2.07 ms 2.58 ms
8192 2.85 ms 3.73 ms

Benchmark Results (K=4096, N=32768)

M No UKernels All UKernels
4 0.784 ms 0.74 ms
8 0.724 ms 0.841 ms
16 0.699 ms 0.822 ms
32 0.784 ms 0.806 ms
64 0.755 ms 0.784 ms
128 0.938 ms 0.874 ms
256 0.821 ms 0.806 ms
512 1.02 ms 1.04 ms
1024 1.33 ms 1.57 ms
2048 2.18 ms 2.47 ms
4096 3.37 ms 3.64 ms
8192 5.65 ms 6.88 ms

Benchmark Results (K=4096, N=65536)

M No UKernels All UKernels
4 0.831 ms 0.931 ms
8 0.749 ms 0.907 ms
16 0.733 ms 0.916 ms
32 0.809 ms 0.835 ms
64 0.929 ms 0.891 ms
128 0.869 ms 0.844 ms
256 1.09 ms 0.835 ms
512 1.38 ms 1.6 ms
1024 2.25 ms 2.35 ms
2048 3.56 ms 3.55 ms
4096 6.71 ms 6.06 ms
8192 12.7 ms 7.99 ms

I did a more granular sweep around these boundaries to show they're reasonable. As you can see, not using a ukernel performs better on smaller M dimensions. From the table it looks like around 384 would be better, but as you can see from above results, that's not always true, so I am being a bit more conservative here.

Comprehensive Matrix Sweep Results - M×K×N - All UKernels

M/N K=256 K=320 K=384 K=448 K=512 K=640 K=768 K=896 K=1024 K=1152
256/32640 0.67 0.71 0.74 0.75 0.74 0.71 0.71 0.71 0.76 0.66
256/32704 0.73 0.76 0.68 0.73 0.62 0.73 0.70 0.70 0.70 0.77
256/32768 0.76 0.79 0.75 0.68 0.79 0.72 0.68 0.67 0.69 0.63
256/32832 0.69 0.68 0.76 0.77 0.66 0.76 0.74 0.68 0.73 0.74
320/32640 0.80 0.73 0.74 0.67 0.75 0.70 0.70 0.78 0.67 0.68
320/32704 0.78 0.77 0.85 0.76 0.73 0.65 0.82 0.79 0.70 0.66
320/32768 0.76 0.74 0.82 0.67 0.72 0.67 0.67 0.74 0.71 0.73
320/32832 0.79 0.77 0.68 0.70 0.66 0.82 0.88 0.68 0.66 0.70
384/32640 0.73 0.74 0.67 0.76 0.74 0.67 0.67 0.72 0.75 0.68
384/32704 0.71 0.66 0.86 0.77 0.70 0.70 0.69 0.69 0.70 0.69
384/32768 0.71 0.74 0.76 0.70 0.71 0.71 0.66 0.68 0.74 0.68
384/32832 0.66 0.66 0.70 0.69 0.75 0.69 0.68 0.69 0.69 0.74
448/32640 0.65 0.73 0.70 0.69 0.66 0.79 0.70 0.72 0.70 0.66
448/32704 0.67 0.70 0.78 0.69 0.69 0.71 0.74 0.79 0.67 0.72
448/32768 0.75 0.65 0.64 0.73 0.83 0.67 0.65 0.73 0.67 0.67
448/32832 0.75 0.70 0.67 0.70 0.66 0.66 0.69 0.81 0.69 0.65
512/32640 0.69 0.71 0.71 0.69 0.77 0.76 0.76 0.75 0.74 0.69
512/32704 0.81 0.64 0.75 0.75 0.74 0.77 0.67 0.73 0.76 0.66
512/32768 0.69 0.69 0.73 0.81 0.67 0.74 0.74 0.69 0.65 0.70
512/32832 0.70 0.67 0.77 0.76 0.65 0.70 0.72 0.71 0.76 0.65
640/32640 0.75 0.66 0.78 0.77 0.69 0.67 0.71 0.85 0.79 0.73
640/32704 0.72 0.75 0.69 0.76 0.79 0.71 0.77 0.70 0.79 0.73
640/32768 0.73 0.84 0.64 0.71 0.70 0.69 0.72 0.76 0.79 0.71
640/32832 0.71 0.81 0.72 0.67 0.80 0.77 0.75 0.72 0.88 0.83
768/32640 0.73 0.74 0.68 0.69 0.81 0.69 0.85 0.80 0.77 0.82
768/32704 0.78 0.67 0.66 0.68 0.73 0.81 0.76 0.79 0.75 0.76
768/32768 0.71 0.72 0.76 0.74 0.71 0.70 0.76 0.82 0.92 0.75
768/32832 0.67 0.70 0.73 0.75 0.64 0.82 0.72 0.79 0.69 0.77
896/32640 0.75 0.76 0.75 0.79 0.71 0.80 0.71 0.83 0.79 0.75
896/32704 0.76 0.66 0.81 0.75 0.75 0.78 0.74 0.75 0.78 0.78
896/32768 0.72 0.65 0.71 0.81 0.68 0.71 0.70 0.74 0.72 0.72
896/32832 0.81 0.65 0.77 0.78 0.69 0.77 0.76 0.78 0.74 0.75
1024/32640 0.72 0.71 0.75 0.78 0.72 0.89 0.76 0.78 0.74 0.84
1024/32704 0.69 0.68 0.71 0.69 0.70 0.81 0.77 0.76 0.74 0.74
1024/32768 0.74 0.77 0.70 0.73 0.73 0.76 0.74 0.77 0.80 0.75
1024/32832 0.70 0.70 0.73 0.78 0.75 0.77 0.75 0.72 0.72 0.82
1152/32640 0.74 0.67 0.83 0.64 0.72 0.81 0.85 0.85 0.72 0.81
1152/32704 0.69 0.78 0.78 0.64 0.77 0.80 0.78 0.78 0.82 0.81
1152/32768 0.74 0.70 0.65 0.78 0.75 0.73 0.78 0.73 0.76 0.81
1152/32832 0.69 0.85 0.78 0.90 0.76 0.81 0.80 0.81 0.80 0.88

All values in milliseconds

Matrix Sweep Results - M×K×N - No UKernels

M/N K=256 K=320 K=384 K=448 K=512 K=640 K=768 K=896 K=1024 K=1152
256/32640 0.81 0.78 0.68 0.70 0.71 0.69 0.73 0.76 0.76 0.65
256/32704 0.70 0.67 0.70 0.70 0.74 0.71 0.77 0.70 0.86 0.63
256/32768 0.66 0.69 0.63 0.71 0.68 0.77 0.68 0.84 0.72 0.65
256/32832 0.65 0.68 0.74 0.78 0.79 0.71 0.74 0.68 0.72 0.76
320/32640 0.66 0.65 0.75 0.74 0.74 0.70 0.83 0.73 0.80 0.63
320/32704 0.74 0.68 0.72 0.75 0.69 0.65 0.80 0.73 0.70 0.66
320/32768 0.76 0.80 0.70 0.67 0.73 0.77 0.67 0.70 0.73 0.76
320/32832 0.69 0.78 0.68 0.70 0.66 0.72 0.80 0.71 0.66 0.75
384/32640 0.82 0.75 0.67 0.76 0.73 0.67 0.67 0.72 0.86 0.69
384/32704 0.81 0.66 0.77 0.78 0.72 0.75 0.69 0.69 0.70 0.69
384/32768 0.73 0.77 0.85 0.74 0.76 0.77 0.66 0.69 0.72 0.73
384/32832 0.72 0.66 0.70 0.69 0.72 0.69 0.68 0.69 0.73 0.75
448/32640 0.67 0.67 0.70 0.69 0.78 0.78 0.74 0.72 0.74 0.66
448/32704 0.77 0.70 0.72 0.69 0.70 0.74 0.78 0.80 0.67 0.69
448/32768 0.65 0.83 0.64 0.73 0.72 0.69 0.74 0.73 0.71 0.75
448/32832 0.65 0.68 0.77 0.70 0.66 0.66 0.73 0.81 0.75 0.78
512/32640 0.70 0.79 0.78 0.73 0.72 0.70 0.76 0.72 0.75 0.71
512/32704 0.69 0.71 0.75 0.82 0.74 0.70 0.67 0.75 0.74 0.72
512/32768 0.77 0.74 0.76 0.79 0.67 0.78 0.75 0.69 0.65 0.70
512/32832 0.71 0.80 0.72 0.74 0.65 0.89 0.78 0.71 0.71 0.65
640/32640 0.88 0.66 0.76 0.78 0.69 0.67 0.71 0.76 0.80 0.73
640/32704 0.70 0.74 0.69 0.77 0.74 0.71 0.78 0.70 0.79 0.72
640/32768 0.81 0.67 0.72 0.74 0.78 0.70 0.77 0.76 0.80 0.77
640/32832 0.67 0.71 0.70 0.87 0.74 0.88 0.72 0.72 0.76 0.71
768/32640 0.77 0.82 0.68 0.69 0.80 0.85 0.81 0.81 0.77 0.85
768/32704 0.79 0.67 0.66 0.68 0.78 0.78 0.76 0.80 0.83 0.82
768/32768 0.71 0.73 0.79 0.76 0.73 0.76 0.87 0.78 0.78 0.75
768/32832 0.70 0.70 0.68 0.70 0.64 0.78 0.72 0.74 0.81 0.72
896/32640 0.76 0.75 0.71 0.75 0.71 0.83 0.71 0.78 0.78 0.75
896/32704 0.72 0.66 0.74 0.75 0.79 0.75 0.74 0.82 0.77 0.75
896/32768 0.71 0.77 0.78 0.74 0.70 0.71 0.70 0.73 0.76 0.72
896/32832 0.74 0.65 0.76 0.75 0.69 0.74 0.76 0.75 0.74 0.75
1024/32640 0.72 0.71 0.77 0.78 0.72 0.84 0.76 0.92 0.74 0.88
1024/32704 0.69 0.68 0.71 0.69 0.70 0.81 0.77 0.76 0.82 0.74
1024/32768 0.74 0.78 0.78 0.73 0.77 0.76 0.74 0.77 0.81 0.75
1024/32832 0.70 0.74 0.73 0.78 0.75 0.77 0.79 0.72 0.72 0.84
1152/32640 0.74 0.67 0.73 0.64 0.72 0.81 0.84 0.90 0.72 0.81
1152/32704 0.69 0.78 0.79 0.64 0.77 0.80 0.78 0.78 0.96 0.81
1152/32768 0.74 0.70 0.65 0.78 0.76 0.73 0.78 0.73 0.79 0.81
1152/32832 0.69 0.75 0.78 0.83 0.78 0.81 0.80 0.82 0.80 0.84

All values in milliseconds

@jtuyls jtuyls requested review from Yu-Zhewen and bjacob October 30, 2025 16:31
@jtuyls jtuyls requested a review from kuhar as a code owner October 30, 2025 16:31
// bounds on dynamic dimensions for encodings.
if (IntegerAttr sizeMin = constraint.getSizeMin()) {
if (iterationSizes[indexVal] < sizeMin.getInt()) {
if (!ShapedType::isDynamic(iterationSizes[indexVal]) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!ShapedType::isDynamic(iterationSizes[indexVal]) &&
if (ShapedType::isStatic(iterationSizes[indexVal]) &&

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants