Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamafile_sgemm API - INT8 implementation #10912

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amritahs-ibm
Copy link
Contributor

@amritahs-ibm amritahs-ibm commented Dec 20, 2024

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <[email protected]>
@amritahs-ibm
Copy link
Contributor Author

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

@slaren
Copy link
Collaborator

slaren commented Dec 23, 2024

We will need to merge #10714 first, since there may be some conflicts.

@amritahs-ibm
Copy link
Contributor Author

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

@Djip007
Copy link
Contributor

Djip007 commented Dec 24, 2024

I'll try to submit it today. 🤞

k, (const block_q8_0 *)A, lda,
(const block_q8_0 *)B, ldb,
(float *)C, ldc,
ith, nth};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #10714

You will have to take ith/nth from params->ith/params->ith. It may be the only change you need to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants