Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 refactor #383

Merged
merged 3 commits into from
Sep 5, 2023
Merged

Conversation

atamurad
Copy link
Contributor

@atamurad atamurad commented Sep 5, 2023

This is a refactor PR to be merged to draft PR #364 (branch feature/int8_try2):

Summary of changes:

  • matmul(), quantize() and dequantize() all take QuantizedTensor * as arguments.
  • Cleaned up memory_map_weights()
  • Code does compile and runs as before refactor

@xefoci7612
Copy link

xefoci7612 commented Sep 5, 2023

@atamurad maybe we can run the extra mile and add also the size

typedef struct {
    int8_t* q;    // quantized values
    float* s; // scaling factors
    int n;  // tensor size
} QuantizedTensor;

I saw in my repo, this simplifies function signatures and the related code even further....

EDIT: BTW last week I ported your patch to my repo, it works very well, thank you. For my code, to split QuantizedTensor per-layer instead of keeping the same layout, like for instance rms_att_weight resulted in some added code machinery, not a big thing of course, but because this layout will probably remain, I'd like just to report this while format is still not finalized.

@karpathy
Copy link
Owner

karpathy commented Sep 5, 2023

@xefoci7612 one reason to potentially not add n is that it runs cahce alignment? without it the data might be cache aligned (?)

@@ -38,23 +38,25 @@ typedef struct {

typedef struct {
// token embedding table
QuantizedTensor token_embedding_table; // (vocab_size, dim)
QuantizedTensor *q_tokens; // (vocab_size, dim)
float* token_embedding_table; // same, but dequantized
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to just dequantize embeddings on demand as in my original pr? i don't super love that this way we're adding a lot of memory and latency

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(startup latency i mean, even though the per token latency would be improved by a tiny amount)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying my comment from previous PR:

In this abstraction - QuantizedTensor is atomic, we do not index into or slice from it. We can matmul with it or dequantize all elements. This abstraction model breaks token embedding (sliced from it) & shared classifier weights (matmul-ed). So I split token embedding into separate rows and exported wcls as one quantizedtensor as well (as if it is not shared).

So options I could think of are:

  1. Partially dequantize QuantizedTensor and break current abstraction of it being opaque/atomic.
  2. Export each row/token embedding as separate QuantizedTensor - this turned out to be bit weird as we had to re-export wcls as 1 QuantizedTensor for final matmul, thus breaking weight sharing (This commit: f850a97)
  3. Current approach in this PR - de quantize all at startup. I think this is the best of all options above.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I follow. Why can't we index into the QuantizedTensor into the correct row? This should work just fine as long as dim % GS == 0, which is the case via an assert in the python export code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can index but goal was to avoid indexing to it from inside forward() function (if you want to add support for other quantization methods/techniques, i.e. int4).
We can implement another quantized_get_row() or smth like that separately to decouple quantization from transformer/forward() code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I appreciate the int4 issue but maybe I haven't stared at this enough and you're ahead of me. I think I'll merge this for now as it is quite nice.

@atamurad
Copy link
Contributor Author

atamurad commented Sep 5, 2023

@xefoci7612 one reason to potentially not add n is that it runs cahce alignment? without it the data might be cache aligned (?)

Why I didn't add it initially is - simply adding n is not enough. We use QuantizedTensor as 1D (activations) and 2D (weights) array interchangeably.

Adding full shape array - it is overcomplicating as we don't need more than 2 dimensions.
Simply adding N (rows) and M (cols) might be a good balance.

I don't see cache alignment as a major issue as we can pad the header.

@karpathy karpathy merged commit 5186b50 into karpathy:feature/int8_try2 Sep 5, 2023
6 checks passed
vinhtran2611 pushed a commit to vinhtran2611/llama2.c that referenced this pull request Jan 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants