int8 refactor #383

atamurad · 2023-09-05T11:38:42Z

This is a refactor PR to be merged to draft PR #364 (branch feature/int8_try2):

Summary of changes:

matmul(), quantize() and dequantize() all take QuantizedTensor * as arguments.
Cleaned up memory_map_weights()
Code does compile and runs as before refactor

xefoci7612 · 2023-09-05T15:56:00Z

@atamurad maybe we can run the extra mile and add also the size

typedef struct {
    int8_t* q;    // quantized values
    float* s; // scaling factors
    int n;  // tensor size
} QuantizedTensor;

I saw in my repo, this simplifies function signatures and the related code even further....

EDIT: BTW last week I ported your patch to my repo, it works very well, thank you. For my code, to split QuantizedTensor per-layer instead of keeping the same layout, like for instance rms_att_weight resulted in some added code machinery, not a big thing of course, but because this layout will probably remain, I'd like just to report this while format is still not finalized.

karpathy · 2023-09-05T21:44:34Z

@xefoci7612 one reason to potentially not add n is that it runs cahce alignment? without it the data might be cache aligned (?)

karpathy · 2023-09-05T21:54:53Z

runq.c

@@ -38,23 +38,25 @@ typedef struct {

 typedef struct {
    // token embedding table
-    QuantizedTensor token_embedding_table; // (vocab_size, dim)
+    QuantizedTensor *q_tokens; // (vocab_size, dim)
+    float* token_embedding_table; // same, but dequantized


is it better to just dequantize embeddings on demand as in my original pr? i don't super love that this way we're adding a lot of memory and latency

(startup latency i mean, even though the per token latency would be improved by a tiny amount)

Copying my comment from previous PR:

In this abstraction - QuantizedTensor is atomic, we do not index into or slice from it. We can matmul with it or dequantize all elements. This abstraction model breaks token embedding (sliced from it) & shared classifier weights (matmul-ed). So I split token embedding into separate rows and exported wcls as one quantizedtensor as well (as if it is not shared).

So options I could think of are:

Partially dequantize QuantizedTensor and break current abstraction of it being opaque/atomic.

Export each row/token embedding as separate QuantizedTensor - this turned out to be bit weird as we had to re-export wcls as 1 QuantizedTensor for final matmul, thus breaking weight sharing (This commit: f850a97)

Current approach in this PR - de quantize all at startup. I think this is the best of all options above.

I'm not sure that I follow. Why can't we index into the QuantizedTensor into the correct row? This should work just fine as long as dim % GS == 0, which is the case via an assert in the python export code?

We can index but goal was to avoid indexing to it from inside forward() function (if you want to add support for other quantization methods/techniques, i.e. int4).
We can implement another quantized_get_row() or smth like that separately to decouple quantization from transformer/forward() code.

Not sure if I appreciate the int4 issue but maybe I haven't stared at this enough and you're ahead of me. I think I'll merge this for now as it is quite nice.

atamurad · 2023-09-05T22:15:38Z

@xefoci7612 one reason to potentially not add n is that it runs cahce alignment? without it the data might be cache aligned (?)

Why I didn't add it initially is - simply adding n is not enough. We use QuantizedTensor as 1D (activations) and 2D (weights) array interchangeably.

Adding full shape array - it is overcomplicating as we don't need more than 2 dimensions.
Simply adding N (rows) and M (cols) might be a good balance.

I don't see cache alignment as a major issue as we can pad the header.

int8 refactor

atamurad added 3 commits August 27, 2023 06:05

draft refactor to use QuantizedTensor in function arguments

f850a97

free() quantizedtensors

06175b9

properly handle token embeddings & shared classifier wcls

6e52df9

karpathy reviewed Sep 5, 2023

View reviewed changes

karpathy merged commit 5186b50 into karpathy:feature/int8_try2 Sep 5, 2023

vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024

Merge pull request karpathy#383 from atamurad/int8_refactor

3e6200a

int8 refactor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

int8 refactor #383

int8 refactor #383

Uh oh!

atamurad commented Sep 5, 2023

Uh oh!

xefoci7612 commented Sep 5, 2023 •

edited

Loading

Uh oh!

karpathy commented Sep 5, 2023

Uh oh!

karpathy Sep 5, 2023

Uh oh!

karpathy Sep 5, 2023

Uh oh!

atamurad Sep 5, 2023

Uh oh!

karpathy Sep 5, 2023

Uh oh!

atamurad Sep 5, 2023

Uh oh!

karpathy Sep 5, 2023

Uh oh!

atamurad commented Sep 5, 2023

Uh oh!

Uh oh!

int8 refactor #383

int8 refactor #383

Uh oh!

Conversation

atamurad commented Sep 5, 2023

Uh oh!

xefoci7612 commented Sep 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karpathy commented Sep 5, 2023

Uh oh!

karpathy Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

karpathy Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

atamurad Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

karpathy Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

atamurad Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

karpathy Sep 5, 2023

Choose a reason for hiding this comment

Uh oh!

atamurad commented Sep 5, 2023

Uh oh!

Uh oh!

xefoci7612 commented Sep 5, 2023 •

edited

Loading