kleidiai: add and integrate SVE 256-bit vector-length kernel #18458

chaxu01 · 2025-12-29T11:31:08Z

Benchmark results for Llama 3.2 1B Q4_0 on Graviton 3 (tokens/sec):
Comparison of Non-SVE vs SVE-enabled kernels.

Threads	Test	w/o SVE (t/s)	w/ SVE (t/s)	Uplift (%)
1	pp512	75.09 ± 0.02	84.51 ± 0.05	+12.55%
1	tg128	18.69 ± 0.01	20.49 ± 0.00	+9.63%
2	pp512	148.77 ± 0.02	166.83 ± 0.02	+12.14%
2	tg128	34.63 ± 0.01	37.21 ± 0.02	+7.45%
4	pp512	293.64 ± 0.07	326.82 ± 0.13	+11.30%
4	tg128	63.49 ± 0.07	67.95 ± 0.02	+7.04%
8	pp512	525.17 ± 0.11	568.10 ± 0.12	+8.17%
8	tg128	97.93 ± 0.03	105.00 ± 0.06	+7.20%
16	pp512	949.33 ± 11.10	1016.97 ± 1.04	+7.13%
16	tg128	131.35 ± 0.39	136.51 ± 0.37	+3.93%

kleidiai: add and integrate SVE 256-bit vector-length kernel

156957e

chaxu01 requested a review from ggerganov as a code owner December 29, 2025 11:31

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Dec 29, 2025

Provide feedback