llamafile_sgemm API - INT8 implementation#10912
Conversation
85c5280 to
d70f5fc
Compare
|
Hi @ggerganov, |
|
We will need to merge #10714 first, since there may be some conflicts. |
|
Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR. |
|
I'll try to submit it today. 🤞 |
d70f5fc to
a8d3700
Compare
|
I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes. |
Djip007
left a comment
There was a problem hiding this comment.
These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.
|
@amritahs-ibm |
dc23ee5 to
4147962
Compare
|
All comments are addressed expect for the last MMA one. The updated patch has been committed. |
@Djip007 Also in case of PowerPC's MMA for int8 data type, MMA engine requires the data to be packed in a different way. So I came up with a specific function for int8(ie packNormal) to do the packing. Please find below the MMA guide: |
|
@Djip007 Could you please help me on my latest comment? |
This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
|
I have addressed all the comments except for the MMA repacking one. As it is an optimization over the current code, I can take it up in the follow on patch. |
This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
gcc-toolset-13 is minimum requirement to build
this patch. Run cmake config and build after path
has been updated with gcc-13:
export PATH=/opt/rh/gcc-toolset-13/root/usr/bin/:$PATH