llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm · 2024-12-20T05:33:43Z

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>

amritahs-ibm · 2024-12-20T07:11:30Z

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

slaren · 2024-12-23T00:24:14Z

We will need to merge #10714 first, since there may be some conflicts.

amritahs-ibm · 2024-12-23T05:07:20Z

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

Djip007 · 2024-12-24T14:12:39Z

I'll try to submit it today. 🤞

Djip007 · 2024-12-24T14:51:09Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+            k, (const block_q8_0 *)A, lda,
+            (const block_q8_0 *)B, ldb,
+            (float *)C, ldc,
+            ith, nth};


with #10714

You will have to take ith/nth from params->ith/params->ith. It may be the only change you need to do.

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024

amritahs-ibm force-pushed the sgemm_q8 branch from 3c60b12 to 85c5280 Compare December 20, 2024 05:35

amritahs-ibm force-pushed the sgemm_q8 branch from 85c5280 to d70f5fc Compare December 20, 2024 06:22

Djip007 reviewed Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile_sgemm API - INT8 implementation #10912

llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm commented Dec 20, 2024 •

edited

Loading

amritahs-ibm commented Dec 20, 2024

slaren commented Dec 23, 2024

amritahs-ibm commented Dec 23, 2024

Djip007 commented Dec 24, 2024

Djip007 Dec 24, 2024

llamafile_sgemm API - INT8 implementation #10912

Are you sure you want to change the base?

llamafile_sgemm API - INT8 implementation #10912

Conversation

amritahs-ibm commented Dec 20, 2024 • edited Loading

amritahs-ibm commented Dec 20, 2024

slaren commented Dec 23, 2024

amritahs-ibm commented Dec 23, 2024

Djip007 commented Dec 24, 2024

Djip007 Dec 24, 2024

Choose a reason for hiding this comment

amritahs-ibm commented Dec 20, 2024 •

edited

Loading