Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization Brainstorming #277

Open
byte-6174 opened this issue Aug 12, 2023 · 23 comments
Open

Quantization Brainstorming #277

byte-6174 opened this issue Aug 12, 2023 · 23 comments

Comments

@byte-6174
Copy link
Contributor

byte-6174 commented Aug 12, 2023

There are several experiments being done with this repo to understand and evaluate the effects of quantization on the llama2.c models.

It is a great test-bed to analyze the effects of varying approaches, as the model sizes here are easier to handle.

Here is what I have in this fork:

  • A simple showcase of symmetric quantization using int8_t to store the multipliers and one float each to store the maximum values of each layer type. Thus we have 13 floats and all other weights are stored as uint8_t.
    image

  • The quantization is done with quantize.c, and the model can be run with runq.c with command like:
    $ ./runq stories42M_Q8.bin -t 0.1 -n 256 -i "One day, Lily met a Shoggoth" -s 2
    It outputs:

One day, Lily met a Shoggoth. She was so excited to meet him. She asked him, "What are you doing here?"
The Shogoth replied, "I'm here to help you learn about the world."
Lily was so happy to have a new friend. She asked, "What can you do?"
The Shogoth replied, "I can help you learn about the world. I can show you all the things that are different."
Lily was so excited to learn about the world. She thanked the Shogoth for being so helpful.
The Shogoth smiled and said, "You're welcome. I'm glad I could help you."
Lily was so happy to have a new friend. She knew that the Shogon was a very special friend.
achieved tok/s: 248.920863

The model sizes are reduced by 4x on disk.
image
During inference, the weights are dequantized to floats so there is no runtime speed up (yet).

  • Additionally, I have a script that plots the statistics of the weights. (You need to install gnuplot [brew install gnuplot on mac] to use it). This might be useful in deciding which layers etc. are less susceptible to compression etc.
image

Would love to hear feedback and other approaches. Note, as the goal of this repo is "... to be the simplest, smallest, most hackable repo..".

Here are some notable forks as well:
https://github.com/atamurad/llama2.c/tree/quant
https://github.com/kroggen/llama2.c/tree/quantization-q8
[Please add yours..]

@jrudolph
Copy link
Contributor

llama2.scala supports ggml-like q4_0 and q8 quantization, doing the quantization on the fly before inference (and also the ability to load ggml models using these quantization types). q4 and q8 have similar speed (when optimized using AVX2 kernels similar to the ones in ggml) which is significantly faster than fp32 (probably caused by more vector lanes, less memory access). The biggest benefit that I see for q4 is obviously that you can load and run bigger models in the same amount of memory. One issue that I noticed, is that auto-vectorization stops to work well for int8 (I suspect because the vpmaddusqw instruction which does the bulk of the int8 matrix multiplication involves saturation which might not be easily expressed in non-vectorized C code).

@Nick-infinity
Copy link
Contributor

Nick-infinity commented Aug 13, 2023

Hello @jrudolph , Can you please help me understand ggml matmul execution wrt quantization.
is it
input_float32 * Quantized_weights -> Output_float32
or
Quantize(input_float32) * Quantized_Weights -> Output_float32 -> Quantize(Output_float32) for next layer ?

I am trying to Come up with very simple int4 and int8 quantization in C++ for llama2.c . My goal is to initially match speed of ggml/llama.cpp quantization and execution level. You help would be appreciated.

@jrudolph
Copy link
Contributor

jrudolph commented Aug 13, 2023 via email

@jrudolph
Copy link
Contributor

jrudolph commented Aug 13, 2023 via email

@Nick-infinity
Copy link
Contributor

Nick-infinity commented Aug 13, 2023 via email

@Nick-infinity
Copy link
Contributor

Nick-infinity commented Aug 13, 2023 via email

@jrudolph
Copy link
Contributor

jrudolph commented Aug 13, 2023 via email

@byte-6174
Copy link
Contributor Author

byte-6174 commented Aug 13, 2023

@jrudolph for activation quantization, do you use data statistics? if so what data is used?
if no data is used, how are the acts calculated in your scala implementation?
sorry, I have 0 scala exp. so reading code is a little tough :)

@jrudolph
Copy link
Contributor

@jrudolph for activation quantization, do you use data statistics? if so what data is used?
if no data is used, how are the acts calculated in your scala implementation?

Not sure, what you mean exactly. I just reused the way that llama.cpp does things. For each weights quantization type they also define which quantization format to use for the activations, and then provide a vec_dot implementation that can multiply those two types (e.g. see https://github.com/ggerganov/llama.cpp/blob/ee77efea2a1e3f7d153976b0934522b6bbaa62e6/ggml.c#L1657-L1663).

For both, q4_0 and q8 they use q8 for the activations. q4_0 and q8 work similarly that each rows is split in blocks of 32 elements, then the range is determined (by finding the maximum absolute value) and values are linearly rescaled centered around zero from -max..max to -128..127 (for q8) or (0 to 15) for q4_0. Then you keep those quantized values and a single (fp16 or fp32) scaling factor per block.

Is that what you mean with data statistics?

@karpathy
Copy link
Owner

So:

  • the weights are quantized once during model export
  • the data (activations) are quantized dynamically on demand during forward pass
  • however i'd expect not all layers are quantized, only the matmul layers (?). e.g. rmsnorms are processed in higher precision (?). I haven't verified these, it's just what's done commonly in practice.

There are many other ways of doing quantization too. E.g. you can try to "calibrate" models by passing many batches through them and recording the activation ranges at all the layers at that time. These ranges are then used in the forward pass later, skipping the process of determinining those ranges.

@jrudolph
Copy link
Contributor

however i'd expect not all layers are quantized, only the matmul layers (?). e.g. rmsnorms are processed in higher precision (?). I haven't verified these, it's just what's done commonly in practice.

Good point, I forgot about these, in the ggml q4_0 files for llama2, the norm weights are all stored in fp32, i.e. the ones corresponding to rms_att_weight, rms_ffn_weight, and rms_final_weight in llama2.c.

If you look at the table in https://huggingface.co/TheBloke/Llama-2-13B-GGML#provided-files, one can see that there are various ways to use different quantization types for different weights. ggerganov/llama.cpp#1684 is the PR that introduced the latest set of quantization setups for llama.cpp and contains lots of information about the choices made.

There are many other ways of doing quantization too. E.g. you can try to "calibrate" models by passing many batches through them and recording the activation ranges at all the layers at that time. These ranges are then used in the forward pass later, skipping the process of determining those ranges.

For people interested: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers has more info about such an approach.

@byte-6174
Copy link
Contributor Author

byte-6174 commented Aug 14, 2023

I'm trying to understand how it is done from the links you provided above, at the base level. I understand that there are many complex mix-and-match strategies like keeping certain layers at high precision etc. however, at the very basic level in llama.cpp the quantization that seems to be done is q4_0.
The code that actually does the q4_0 quantization is in ggml.c:

static void quantize_row_q4_0_reference(const float * restrict x, block_q4_0 * restrict y, int k) {
    static const int qk = QK4_0;
    assert(k % qk == 0);
    const int nb = k / qk;
    for (int i = 0; i < nb; i++) {
        float amax = 0.0f; // absolute max
        float max  = 0.0f;

        for (int j = 0; j < qk; j++) {
            const float v = x[i*qk + j];
            if (amax < fabsf(v)) {
                amax = fabsf(v);
                max  = v;
            }
        }
        const float d  = max / -8;
        const float id = d ? 1.0f/d : 0.0f;
        y[i].d = GGML_FP32_TO_FP16(d);
        for (int j = 0; j < qk/2; ++j) {
            const float x0 = x[i*qk + 0    + j]*id;
            const float x1 = x[i*qk + qk/2 + j]*id;
            const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
            const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
            y[i].qs[j]  = xi0;
            y[i].qs[j] |= xi1 << 4;
        }
    }
}

This seems to do symmetric quantization in 4-bits (?).

I don't see any of the GPTQ-type Hessian computation with data, and matrix inversion as described in the GPTQ paper above.
Can you point me to where that is happening?

@jrudolph
Copy link
Contributor

jrudolph commented Aug 14, 2023 via email

@byte-6174
Copy link
Contributor Author

got it, thanks..

@atamurad
Copy link
Contributor

atamurad commented Aug 15, 2023

@byte-6174 thanks for linking to my branch!

I wanted to add some details/results so far as my branch is draft and not documented yet.

Code structure:

  • QMatrix - new 2D data structure to represent quantized weights and qmatmul function to perform matrix-vector multiplication. I think it helps with code readability as transformer() function requires no changes at all.
  • File format - I wanted to experiment with different quantization methods for different layers/weights in one file/model. QMatrix has 4-byte type to represent quantization type - 'Q8_A', 'Q8_B', etc. inspired by https://en.wikipedia.org/wiki/FourCC
  • quant.py - quantization code is implemented in Python and run.c only contains de-quantization code. This is to keep it consistent with the project structure - right now all models are converted/trained/exported from Python.
  • Code is not optimized for speed as I focused on getting quantization and model output right first.

Quantization methods:

  • Q8_A - Block of 128 weights in column. Only scale (fp32) parameter and 128x 8bit ints per block.
  • Q8_B - Block of 128 weights in row. scale (fp32) and mean (fp32) parameter and 128x 8bit ints per block.
  • Q4_A - Block of 256 weights in row. Weights are sorted and split into equal 16 bins, each bin containing 16 elements. Bin mean values (bins (16 x fp32)) and 256x 4bit indexes packed to 128 bytes are exported.

Results summary

  • Stories110M model, Q4_A => 79MB model size, did not observe any degradation in output quality.
  • LLama2-7B-chat, Q8_A/Q8_B => 6.7GB model size, output is OK but slow as expected.
  • LLama2-7B-chat, Q4_A => 4.8GB model, didn't work, output is gibberish.

Sample output

./run llama2_7b_chat.q8_pooled -i "[INST] write a poem about math [/INST]" 
[INST] write a poem about math [/INST]  Sure! Here's a poem about MATH:

Math, the beat of life,
A rhythm so precise and true,
In every line, a code unbroken,
 Numbers that flow, like a river's tide,
Geometry of life, a equation so grand,
Squared, the truth unveiled,
The beauty of numbers, a cosmic, 

[truncated]

@byte-6174
Copy link
Contributor Author

@atamurad:

LLama2-7B-chat, Q8_A/Q8_B => 6.7GB model size, output is OK but slow as expected.

slower than float32 model?

@kroggen
Copy link
Contributor

kroggen commented Aug 16, 2023

Just a reminder that FlashAttention also reduces the amount of memory required

It does not need the intermediary attention matrix for each head, instead computing by tiles, and applying a trick on the softmax (computed in chunks, then normalized with factors)

This could be the right project for a simple implementation

@jrudolph
Copy link
Contributor

jrudolph commented Aug 16, 2023 via email

@RahulSChand
Copy link
Contributor

@jrudolph Yes correct, also flash attention works by reducing IO b/w GPU HBM & SRAM, which doesn't apply here since its CPU only inference.

Author of the paper pointed it out here : Dao-AILab/flash-attention#59

@mgrabban
Copy link

@byte-6174 thanks for linking to my branch!

I wanted to add some details/results so far as my branch is draft and not documented yet.

Code structure:

  • QMatrix - new 2D data structure to represent quantized weights and qmatmul function to perform matrix-vector multiplication. I think it helps with code readability as transformer() function requires no changes at all.
  • File format - I wanted to experiment with different quantization methods for different layers/weights in one file/model. QMatrix has 4-byte type to represent quantization type - 'Q8_A', 'Q8_B', etc. inspired by https://en.wikipedia.org/wiki/FourCC
  • quant.py - quantization code is implemented in Python and run.c only contains de-quantization code. This is to keep it consistent with the project structure - right now all models are converted/trained/exported from Python.
  • Code is not optimized for speed as I focused on getting quantization and model output right first.

Quantization methods:

  • Q8_A - Block of 128 weights in column. Only scale (fp32) parameter and 128x 8bit ints per block.
  • Q8_B - Block of 128 weights in row. scale (fp32) and mean (fp32) parameter and 128x 8bit ints per block.
  • Q4_A - Block of 256 weights in row. Weights are sorted and split into equal 16 bins, each bin containing 16 elements. Bin mean values (bins (16 x fp32)) and 256x 4bit indexes packed to 128 bytes are exported.

Results summary

  • Stories110M model, Q4_A => 79MB model size, did not observe any degradation in output quality.
  • LLama2-7B-chat, Q8_A/Q8_B => 6.7GB model size, output is OK but slow as expected.
  • LLama2-7B-chat, Q4_A => 4.8GB model, didn't work, output is gibberish.

Sample output

./run llama2_7b_chat.q8_pooled -i "[INST] write a poem about math [/INST]" 
[INST] write a poem about math [/INST]  Sure! Here's a poem about MATH:

Math, the beat of life,
A rhythm so precise and true,
In every line, a code unbroken,
 Numbers that flow, like a river's tide,
Geometry of life, a equation so grand,
Squared, the truth unveiled,
The beauty of numbers, a cosmic, 

[truncated]

@atamurad
Is the following line in quantize_q8_a() correct?

scales[i][j] = np.max(np.abs(m[i:i+QK, j]))

I think it should be

scales[i][j] = np.max(np.abs(m[i*QK:(i+1)*QK, j]))

Maybe I am referring to the wrong repo?

@atamurad
Copy link
Contributor

@mgrabban good catch, thank you!

I was wondering why Q8_A wasn't working for weights other than WQ, WK, WV, WO and switched these weights to Q8_B.

Why it worked for WQ,WK,WV,WO is probably these all share almost same or very close max value for all grouped rows.

@atamurad
Copy link
Contributor

I've another data point to add: I had some success running 4-bit quantized LLama2-7B-chat model with run.c.

Speedup is 10x compared to FP32 weights. 4bit model file size: 4.3GB

Quantization is based on (AWQ)

Activations are in FP32, so only matmul has changed in run.c. I used AVX2 for dequantization + matrix multiplication. For 32bit weights (only final logit classifier), I also use #269

repo: https://github.com/atamurad/llama2.c/tree/int4-avx2

Issues: long prompts/generation is affected by this bug in HF export script: #286 (comment)

@pluto-llf
Copy link

@atamurad Hello, I failed export model when i use the export_awq.py. The error is like "KeyError: 'model.layers.0.mlp.gate_proj.qweight'". As I'm new to this process, I was wondering if you could provide any suggestions? Or could it be that the script is not up-to-date?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants