-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
int8 refactor #383
int8 refactor #383
Conversation
@atamurad maybe we can run the extra mile and add also the size typedef struct {
int8_t* q; // quantized values
float* s; // scaling factors
int n; // tensor size
} QuantizedTensor; I saw in my repo, this simplifies function signatures and the related code even further.... EDIT: BTW last week I ported your patch to my repo, it works very well, thank you. For my code, to split QuantizedTensor per-layer instead of keeping the same layout, like for instance rms_att_weight resulted in some added code machinery, not a big thing of course, but because this layout will probably remain, I'd like just to report this while format is still not finalized. |
@xefoci7612 one reason to potentially not add n is that it runs cahce alignment? without it the data might be cache aligned (?) |
@@ -38,23 +38,25 @@ typedef struct { | |||
|
|||
typedef struct { | |||
// token embedding table | |||
QuantizedTensor token_embedding_table; // (vocab_size, dim) | |||
QuantizedTensor *q_tokens; // (vocab_size, dim) | |||
float* token_embedding_table; // same, but dequantized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it better to just dequantize embeddings on demand as in my original pr? i don't super love that this way we're adding a lot of memory and latency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(startup latency i mean, even though the per token latency would be improved by a tiny amount)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying my comment from previous PR:
In this abstraction - QuantizedTensor is atomic, we do not index into or slice from it. We can matmul with it or dequantize all elements. This abstraction model breaks token embedding (sliced from it) & shared classifier weights (matmul-ed). So I split token embedding into separate rows and exported wcls as one quantizedtensor as well (as if it is not shared).
So options I could think of are:
- Partially dequantize QuantizedTensor and break current abstraction of it being opaque/atomic.
- Export each row/token embedding as separate QuantizedTensor - this turned out to be bit weird as we had to re-export wcls as 1 QuantizedTensor for final matmul, thus breaking weight sharing (This commit: f850a97)
- Current approach in this PR - de quantize all at startup. I think this is the best of all options above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that I follow. Why can't we index into the QuantizedTensor into the correct row? This should work just fine as long as dim % GS == 0, which is the case via an assert in the python export code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can index but goal was to avoid indexing to it from inside forward()
function (if you want to add support for other quantization methods/techniques, i.e. int4).
We can implement another quantized_get_row()
or smth like that separately to decouple quantization from transformer/forward() code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I appreciate the int4 issue but maybe I haven't stared at this enough and you're ahead of me. I think I'll merge this for now as it is quite nice.
Why I didn't add it initially is - simply adding n is not enough. We use QuantizedTensor as 1D (activations) and 2D (weights) array interchangeably. Adding full I don't see cache alignment as a major issue as we can pad the header. |
int8 refactor
This is a refactor PR to be merged to draft PR #364 (branch
feature/int8_try2
):Summary of changes:
matmul()
,quantize()
anddequantize()
all takeQuantizedTensor *
as arguments.memory_map_weights()