-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quantization_cpu base version #1190
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tombawor . Great first PR! Thank you and welcome!
I added a few comments for discussion.
Thanks again for taking the initiative on this and digging up the quantization code.
thunder/tests/test_quantization.py
Outdated
@@ -0,0 +1,81 @@ | |||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome to have comprehensive tests!
I think these could go into test_transforms (where there are also some tests that seem to need an update).
thunder/tests/test_quantization.py
Outdated
), "Quantized tensor should have fewer or equal elements due to compression" | ||
|
||
|
||
# Optional: Performance tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I think these are cool to demonstrate the value, but tests are maybe not the best place for it.
If we want to show the performance, maybe we could create a quantization notebook or so? Similar to the "what's going on under the hood in fsdp".
thunder/transforms/quantization.py
Outdated
w_work = torch.zeros_like(w, device="cuda") | ||
elif w.device.type != "cuda": | ||
num_elements = w.numel() | ||
return torch.empty((num_elements, 1), device="meta", dtype=torch.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the formula to computue the size is not quite right?
Also, the quantize_weight function needs to return both the quantized weight and a quantization state.
Maybe it would be possible to use (or adapt) the CPU code in impl to handle meta?
thunder/transforms/quantization.py
Outdated
# if future use cases require more flexibility, such as further model training or analysis | ||
# of quantization effects on the CPU. | ||
if w.device.type == "cpu": | ||
return quantize_4bit_impl(w, quant_type="nf4")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want both the quantized weight and quantization state.
@@ -0,0 +1,224 @@ | |||
# NOTE: The code for CPU quantization in this file has been adapted from a not-yet-merged branch of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the original copyright notice and a link to the license?
(also, referring to permalinks instead of branches is a bit more reliable)
Hey, so something seems still up.
|
The multi-backend refactor branch is still not included in the latest release of bitsandbytes as of version 0.44.x.
|
@tombawor I hope your year started well! It seems that this fallen a bit off the radar, are you still interested in getting it to merging? |
Yes, I'm interested. |
Great! Can't blame you for that. :) Give me a shout if you need anything. |
Before submitting
What does this PR do?
This PR addresses Issue #1111 by introducing an implementation of CPU-based 4-bit quantization in
quantization_cpu.py
. The code is adapted from a not-yet-merged branch of thebitsandbytes
library. Key highlights include:quantize_weight
for CPU without returning the quantization state to optimize for inference.PR review
Anyone in the community is welcome to review this PR once all tests have passed. Given the connection to issue #1111, feedback and suggestions for improvements are appreciated. If the PR hasn't been discussed in the issue, please reach out for context to improve its chances of merging.
Did you have fun?
Absolutely 🙃