Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantization_cpu base version #1190

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

tombawor
Copy link

Before submitting

What does this PR do?

This PR addresses Issue #1111 by introducing an implementation of CPU-based 4-bit quantization in quantization_cpu.py. The code is adapted from a not-yet-merged branch of the bitsandbytes library. Key highlights include:

  • Implementing quantize_weight for CPU without returning the quantization state to optimize for inference.
  • A note has been added in the code to revisit returning the quantization state if fine-tuning or dequantization becomes necessary in future use cases.

PR review

Anyone in the community is welcome to review this PR once all tests have passed. Given the connection to issue #1111, feedback and suggestions for improvements are appreciated. If the PR hasn't been discussed in the issue, please reach out for context to improve its chances of merging.

Did you have fun?

Absolutely 🙃

Copy link
Collaborator

@t-vi t-vi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tombawor . Great first PR! Thank you and welcome!

I added a few comments for discussion.

Thanks again for taking the initiative on this and digging up the quantization code.

@@ -0,0 +1,81 @@
import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome to have comprehensive tests!
I think these could go into test_transforms (where there are also some tests that seem to need an update).

), "Quantized tensor should have fewer or equal elements due to compression"


# Optional: Performance tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I think these are cool to demonstrate the value, but tests are maybe not the best place for it.
If we want to show the performance, maybe we could create a quantization notebook or so? Similar to the "what's going on under the hood in fsdp".

w_work = torch.zeros_like(w, device="cuda")
elif w.device.type != "cuda":
num_elements = w.numel()
return torch.empty((num_elements, 1), device="meta", dtype=torch.uint8)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the formula to computue the size is not quite right?
Also, the quantize_weight function needs to return both the quantized weight and a quantization state.
Maybe it would be possible to use (or adapt) the CPU code in impl to handle meta?

# if future use cases require more flexibility, such as further model training or analysis
# of quantization effects on the CPU.
if w.device.type == "cpu":
return quantize_4bit_impl(w, quant_type="nf4")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want both the quantized weight and quantization state.

@@ -0,0 +1,224 @@
# NOTE: The code for CPU quantization in this file has been adapted from a not-yet-merged branch of the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the original copyright notice and a link to the license?
(also, referring to permalinks instead of branches is a bit more reliable)

@t-vi
Copy link
Collaborator

t-vi commented Oct 18, 2024

Hey, so something seems still up.
But there is some progress outside:

@tombawor
Copy link
Author

The multi-backend refactor branch is still not included in the latest release of bitsandbytes as of version 0.44.x.
Last changes for this PR contains:

  • absmax dtype consistency. There is torch.float32 on CUDA.
  • quantize out tensor dimension consistency. There is out tensor as 2D shape [N, 1] on CUDA.
    With above changes test_transforms.py::test_quantization_on_meta PASSED on version before last change with move litgpt imports into tests.

@t-vi
Copy link
Collaborator

t-vi commented Jan 21, 2025

@tombawor I hope your year started well!

It seems that this fallen a bit off the radar, are you still interested in getting it to merging?

@tombawor
Copy link
Author

Yes, I'm interested.
I will refresh that this week.
I was lost in materials like Build a Large Language Model (From Scratch) by Sebastian Raschka.

@t-vi
Copy link
Collaborator

t-vi commented Jan 21, 2025

Great! Can't blame you for that. :) Give me a shout if you need anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

quantization: process tensors on meta device directly, maybe implement CPU quantization (if it is easy)
2 participants