-
Notifications
You must be signed in to change notification settings - Fork 90
[Benchmark] Reproduce GPTQv2 results #1545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @eldarkurtic Use
That's the script I used with above changes to create the test result in the readme bench. I commented the 4 lines I changed. Everything else is same. Tests were performed on A100. class TestLlama3_2(ModelTest):
NATIVE_MODEL_ID = "meta/Llama-3.1-8B-Instruct" # change
NATIVE_ARC_CHALLENGE_ACC = 0.3567
NATIVE_ARC_CHALLENGE_ACC_NORM = 0.3805
QUANT_ARC_MAX_DELTA_FLOOR_PERCENT = 0.36
APPLY_CHAT_TEMPLATE = True
V2 = True # change
QUANTIZE_CONFIG_BITS = 4 # change
TASK_NAME = [EVAL.LM_EVAL.ARC_CHALLENGE, EVAL.LM_EVAL.GSM8K_PLATINUM_COT] # change
def test_llama3_2(self):
self.quant_lm_eval() Let me know if you run into issues. Above is 100% of the script used and you should be able to replicate. |
Thanks a lot, will give it a try. Do you by any chance have the models already available somewhere? Are GPTQv2 models runnable in vLLM? |
@eldarkurtic I did not store the quantized models, I should. Let me add this to my to-do. GPTQ v2 is only different in the quantization process and output is 100% gptq compliant so it will run on all kernels/inference engines that supports gptq (v1) including vllm and sglang. |
Any chance you could share this GPTQv2 W4g128 model? |
Yes. I will requant and push the 4 models (3/4 + v2/v1) models to HF later today. |
Env:
Reproduced models, quant script, and eval script: https://huggingface.co/ModelCloud/GPTQ-v1-Llama-3.1-8B-Instruct I included updated benchmark results in the HF repos since this quantized model has slightly different results but the monster difference between v1 v2 difference for Quant is use aforementioned code with c4/en, 256 samples, goup size 128. I posted the exact code used to evaluate the mode. I use # eval
from lm_eval.tasks import TaskManager
from lm_eval.utils import make_table
with tempfile.TemporaryDirectory() as tmp_dir:
results = GPTQModel.eval(
QUANT_SAVE_PATH,
tasks=[EVAL.LM_EVAL.ARC_CHALLENGE, EVAL.LM_EVAL.GSM8K_PLATINUM_COT],
apply_chat_template=True,
random_seed=898,
output_path= tmp_dir,
)
print(make_table(results))
if "groups" in results:
print(make_table(results, "groups")) v1:
v2:
So v2 offers slightly higher accuracy when only measuring by logic probability like Full quantization code below: import tempfile
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
from gptqmodel.quantization import FORMAT
from gptqmodel.utils.eval import EVAL
from logbar import LogBar
log = LogBar.shared()
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
CFG_BITS = 4
CFG_GROUPSIZE = 128
CFG_V2 = True
INPUTS_MAX_LENGTH = 2048 # in tokens
QUANT_SAVE_PATH = f"/your_path/gptq_v2_{CFG_V2}_bit_{CFG_BITS}_gpsize_{CFG_GROUPSIZE}_llama_3.1_8B_Instruct"
def get_calib_data(tokenizer, rows: int):
# calibration_dataset = load_dataset(
# "allenai/c4",
# data_files="en/c4-train.00000-of-01024.json.gz",
# split="train"
# )
calibration_dataset = load_dataset(
"json",
data_files="/your_path/dataset/c4-train.00000-of-01024.json.gz",
split="train")
datas = []
for index, sample in enumerate(calibration_dataset):
tokenized = tokenizer(sample["text"])
if len(tokenized.data['input_ids']) <= INPUTS_MAX_LENGTH:
datas.append(tokenized)
if len(datas) >= rows:
break
return datas
quant_config = QuantizeConfig(
bits=CFG_BITS,
group_size=CFG_GROUPSIZE,
format=FORMAT.GPTQ,
desc_act=True,
sym=True,
v2=CFG_V2,
)
log.info(f"QuantConfig: {quant_config}")
log.info(f"Save Path: {QUANT_SAVE_PATH}")
# load un-quantized native model
model = GPTQModel.load(MODEL_ID, quant_config)
# load calibration data
calibration_dataset = get_calib_data(tokenizer=model.tokenizer, rows=256)
model.quantize(calibration_dataset, batch_size=1)
model.save(QUANT_SAVE_PATH)
log.info(f"Quant Model Saved to: {QUANT_SAVE_PATH}") Note: Both v1 and v2 quant and eval are produced via the above. |
Hi, I would like to reproduce GPTQv2 W4g128 evals shown in the README.
Could you help me by:
The text was updated successfully, but these errors were encountered: