-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization Brainstorming #277
Comments
llama2.scala supports ggml-like q4_0 and q8 quantization, doing the quantization on the fly before inference (and also the ability to load ggml models using these quantization types). q4 and q8 have similar speed (when optimized using AVX2 kernels similar to the ones in ggml) which is significantly faster than fp32 (probably caused by more vector lanes, less memory access). The biggest benefit that I see for q4 is obviously that you can load and run bigger models in the same amount of memory. One issue that I noticed, is that auto-vectorization stops to work well for int8 (I suspect because the |
Hello @jrudolph , Can you please help me understand ggml matmul execution wrt quantization. I am trying to Come up with very simple int4 and int8 quantization in C++ for llama2.c . My goal is to initially match speed of ggml/llama.cpp quantization and execution level. You help would be appreciated. |
For q4_0 and q8 the one dimensional vector needs to be first quantized to
q8.
Then, do one block of vector products in int32 and use block scales into
f32 running sum.
Then store result for each row and continue with next row.
See
https://github.com/jrudolph/llama2.scala/blob/08c65d04c0a3a4345510db289779e3243bcf7ff9/shared/src/main/scala/net/virtualvoid/llama2/ScalaMathImplementation.scala#L70
as an example assuming the onedim vector has already been quantized.
Nickinfinity ***@***.***> schrieb am So., 13. Aug. 2023,
11:04:
… Hello @jrudolph <https://github.com/jrudolph> , Can you please help me
understand ggml matmul execution wrt quantization.
is it
input_float32 * Quantized_weights -> Output_float32
or
Quantize(input_float32) * Quantized_Weights -> Output_float32 ->
Quantize(Output_float32) for next layer ?
—
Reply to this email directly, view it on GitHub
<#277 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAACNDCQBMCFZ7HZ6A55W63XVCKBNANCNFSM6AAAAAA3ODMNVQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
In general, I guess you can use whatever works. Doing the bulk of
multiplications in int8 means that you can do more elements per instruction
with vector instructions than with f32 etc. Also integer calculations may
per faster than float depending on the architecture, etc
|
Thanks for your reply.
When you say one dimensional vector, are you talking about the input 1d
vector ?
Sorry If this is basic question.
Thanks
…On Sun, Aug 13, 2023, 5:09 PM Johannes Rudolph ***@***.***> wrote:
In general, I guess you can use whatever works. Doing the bulk of
multiplications in int8 means that you can do more elements per instruction
with vector instructions than with f32 etc. Also integer calculations may
per faster than float depending on the architecture, etc
—
Reply to this email directly, view it on GitHub
<#277 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIHM53456KAHGZ4QPCXRU43XVC4H3ANCNFSM6AAAAAA3ODMNVQ>
.
You are receiving this because you commented.Message ID: <karpathy/llama2.
***@***.***>
|
If I understood it correctly, then it means for mat mul we have to quantize
the input 1d array. I am wondering if latency to quantize this vector can
surpass the gains of doing product in int32.
…On Sun, Aug 13, 2023, 6:48 PM Nikhil Gupta ***@***.***> wrote:
Thanks for your reply.
When you say one dimensional vector, are you talking about the input 1d
vector ?
Sorry If this is basic question.
Thanks
On Sun, Aug 13, 2023, 5:09 PM Johannes Rudolph ***@***.***>
wrote:
> In general, I guess you can use whatever works. Doing the bulk of
> multiplications in int8 means that you can do more elements per
> instruction
> with vector instructions than with f32 etc. Also integer calculations may
> per faster than float depending on the architecture, etc
>
> —
> Reply to this email directly, view it on GitHub
> <#277 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AIHM53456KAHGZ4QPCXRU43XVC4H3ANCNFSM6AAAAAA3ODMNVQ>
> .
> You are receiving this because you commented.Message ID: <karpathy/llama2
> .***@***.***>
>
|
I am wondering if latency to quantize this vector can surpass the gains
of doing product in int32.
I guess it might for certain shape. For llama, e.g. the single most
expensive matmul is the output computation into logits. There you have a
weights matrix of dim X vocab. So, let's say dim is 4096 and vocab is
32000. You have to do the input quantization for 4096 elements of the input
vector once but then use it to multiply 32000 rows into a 32000 row vector
(32000 * 4096 multiplications). So, if there's a speed benefit of doing
quantized calculations, it will amortize quickly.
For SIMD it is a requirement that the vector types line up, so you will
have to do it. For regular types, I guess it might depend on the speed of
integer vs float processing units in your CPU whether it makes sense.
|
@jrudolph for activation quantization, do you use data statistics? if so what data is used? |
Not sure, what you mean exactly. I just reused the way that llama.cpp does things. For each weights quantization type they also define which quantization format to use for the activations, and then provide a vec_dot implementation that can multiply those two types (e.g. see https://github.com/ggerganov/llama.cpp/blob/ee77efea2a1e3f7d153976b0934522b6bbaa62e6/ggml.c#L1657-L1663). For both, q4_0 and q8 they use q8 for the activations. q4_0 and q8 work similarly that each rows is split in blocks of 32 elements, then the range is determined (by finding the maximum absolute value) and values are linearly rescaled centered around zero from -max..max to -128..127 (for q8) or (0 to 15) for q4_0. Then you keep those quantized values and a single (fp16 or fp32) scaling factor per block. Is that what you mean with data statistics? |
So:
There are many other ways of doing quantization too. E.g. you can try to "calibrate" models by passing many batches through them and recording the activation ranges at all the layers at that time. These ranges are then used in the forward pass later, skipping the process of determinining those ranges. |
Good point, I forgot about these, in the ggml q4_0 files for llama2, the norm weights are all stored in fp32, i.e. the ones corresponding to If you look at the table in https://huggingface.co/TheBloke/Llama-2-13B-GGML#provided-files, one can see that there are various ways to use different quantization types for different weights. ggerganov/llama.cpp#1684 is the PR that introduced the latest set of quantization setups for llama.cpp and contains lots of information about the choices made.
For people interested: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers has more info about such an approach. |
I'm trying to understand how it is done from the links you provided above, at the base level. I understand that there are many complex mix-and-match strategies like keeping certain layers at high precision etc. however, at the very basic level in llama.cpp the quantization that seems to be done is
This seems to do symmetric quantization in 4-bits (?). I don't see any of the GPTQ-type Hessian computation with data, and matrix inversion as described in the GPTQ paper above. |
Sorry, I didn't want to imply that GPTQ style quantization is done in
llama.cpp. I'm not sure it is.
|
got it, thanks.. |
@byte-6174 thanks for linking to my branch! I wanted to add some details/results so far as my branch is draft and not documented yet. Code structure:
Quantization methods:
Results summary
Sample output
[truncated] |
slower than float32 model? |
Just a reminder that FlashAttention also reduces the amount of memory required It does not need the intermediary attention matrix for each head, instead computing by tiles, and applying a trick on the softmax (computed in chunks, then normalized with factors) This could be the right project for a simple implementation |
I wonder if Flash Attention even applies here? AFAIU the big NxN matrix
only turns up when evaluating the prompt in batches (in training or when
having big input prompts during inference). In this project, tokens are
currently processed one-by-one, so that only one row of the attention
matrix is ever materialized at once.
From playing with the smaller models (with limited context sizes) that are
useful with CPU inference it seems most of the time is spent evaluating the
big matrix calculations for the FFN and logit calculations, while attention
only requires significant calculation when the context starts to get filled.
Correct me if I'm wrong, but it seems many of the optimizations target big
models for serious productization. The sequential nature of deep neural
nets plays against the strength of GPUs to massively parallize, so to
saturate GPUs, one has to come up with strategies how to make each
calculation wider (batch processing of prompts or evaluating multiple
prompts at the same time) without exploding memory requirements (which is
where Flash Attention seems to come into play).
|
@jrudolph Yes correct, also flash attention works by reducing IO b/w GPU HBM & SRAM, which doesn't apply here since its CPU only inference. Author of the paper pointed it out here : Dao-AILab/flash-attention#59 |
@atamurad
I think it should be
Maybe I am referring to the wrong repo? |
@mgrabban good catch, thank you! I was wondering why Q8_A wasn't working for weights other than WQ, WK, WV, WO and switched these weights to Q8_B. Why it worked for WQ,WK,WV,WO is probably these all share almost same or very close max value for all grouped rows. |
I've another data point to add: I had some success running 4-bit quantized LLama2-7B-chat model with run.c. Speedup is 10x compared to FP32 weights. 4bit model file size: 4.3GB Quantization is based on (AWQ) Activations are in FP32, so only matmul has changed in run.c. I used AVX2 for dequantization + matrix multiplication. For 32bit weights (only final logit classifier), I also use #269 repo: https://github.com/atamurad/llama2.c/tree/int4-avx2 Issues: long prompts/generation is affected by this bug in HF export script: #286 (comment) |
@atamurad Hello, I failed export model when i use the |
There are several experiments being done with this repo to understand and evaluate the effects of quantization on the
llama2.c
models.It is a great test-bed to analyze the effects of varying approaches, as the model sizes here are easier to handle.
Here is what I have in this fork:
A simple showcase of symmetric quantization using int8_t to store the multipliers and one float each to store the maximum values of each layer type. Thus we have 13 floats and all other weights are stored as uint8_t.
The quantization is done with
quantize.c
, and the model can be run withrunq.c
with command like:$ ./runq stories42M_Q8.bin -t 0.1 -n 256 -i "One day, Lily met a Shoggoth" -s 2
It outputs:
The model sizes are reduced by 4x on disk.
During inference, the weights are dequantized to floats so there is no runtime speed up (yet).
brew install gnuplot
on mac] to use it). This might be useful in deciding which layers etc. are less susceptible to compression etc.Would love to hear feedback and other approaches. Note, as the goal of this repo is "... to be the simplest, smallest, most hackable repo..".
Here are some notable forks as well:
https://github.com/atamurad/llama2.c/tree/quant
https://github.com/kroggen/llama2.c/tree/quantization-q8
[Please add yours..]
The text was updated successfully, but these errors were encountered: