Skip to content

Cannot Replicate Reported GSM8K-CoT Results from HF Model Using GPTQModel Codebase #1560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Eijnewgnaw opened this issue Apr 25, 2025 · 4 comments

Comments

@Eijnewgnaw
Copy link

Eijnewgnaw commented Apr 25, 2025

Hi,

Firstly, I want to commend you on the incredible work you've done with this GPTQModel. It's a truly innovative approach, and I’m very excited about the potential it holds.

I’ve been working with the Meta-Llama-3.2-1B-Instruct model from the HuggingFace page below, and I’m having some issues replicating the reported results for GSM8K-CoT:

https://huggingface.co/ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5

I am using the model from HuggingFace and performing quantization using the author’s codebase as outlined in the README, with exactly the same hyperparameters as those provided in the link above. Despite this, I’m seeing a significant performance gap (~20%) on GSM8K-CoT. This difference is not just isolated to GSM8K, but also appears in HumanEval and ARC-Challenge, with several percentage points of drop in performance on those tasks as well.

What I did

Model: Meta-Llama-3.2-1B-Instruct from HuggingFace

Quantization: GPTQ with bits=4, group_size=32, desc_act=True, static_groups=True, following the author's README example for quantization

Kernel: Auto-selected MarlinQuantLinear

Evaluation: lm-eval with task gsm8k_cot_llama

Calibration datasets: tested both wikitext2 and c4

Sampling settings:

do_sample=False

temperature=None

top_p=None
(these were set to ensure reproducibility and reduce sampling variance — the gap persists even with recommended settings)

My questions

Was this model quantized after instruction tuning or CoT fine-tuning?

Was any special prompt formatting or chat template applied during the evaluation phase?

Are there any internal differences that might explain the ~20% performance gap I’m seeing on GSM8K-CoT, and the similar drops on other tasks like HumanEval / ARC-Challenge?

Reason for urgency

I am planning to build upon your codebase and make some innovative modifications to explore new research directions. Given the potential for meaningful contributions to the field, I am eager to resolve this issue as soon as possible to continue with the next steps in my work.

I would greatly appreciate your help, and I’m hoping for your guidance on what might be causing this discrepancy. Your timely response would be incredibly valuable for my ongoing project.

Thank you so much for your time and for sharing such an impactful codebase!

Best regards,
A struggling graduate student

@Qubitium
Copy link
Collaborator

@Eijnewgnaw Please note the model you posted was lm-eval tested using vllm. You should test using the same inference tool as different tooling will cause lm-eval differences.

Secondly, are you unable to reproduce using your own quantized model or the actual model weights from our HF model?

https://huggingface.co/ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5

Image

@Eijnewgnaw
Copy link
Author

Eijnewgnaw commented Apr 28, 2025

@Qubitium
eval_Llama-3.2-1B-Instruct-gptqmodel-4bit_20250428_120202.log

quant.log

Thanks for your reply!

I understand that you used vLLM for the official evaluation.
In my case, I initially tried vLLM as backend but encountered CUDA compatibility issues (no kernel image available), so I temporarily switched to torch-only inference (backend='auto').
From my understanding, using vLLM mainly affects inference speed, but should not significantly affect evaluation accuracy on tasks like GSM8K-CoT.

Regarding the model used:
I strictly used the Hugging Face official model Llama-3.2-1B-Instruct (the same base model you referenced) for quantization, and did not use any other sources.
The quantization process followed standard GPTQ settings (4-bit, group size 32), and I have attached the full quantization log (quant.log) for reference to ensure transparency and reproducibility.

As next steps:

I will set up a compatible vLLM environment and re-run evaluation using vLLM as backend to fully align with your setup.

Meanwhile, I would really appreciate it if you could kindly help analyze possible reasons for my evaluation results being significantly lower than expected, based on the logs (quant.log and eval.log) I provided.

I want to make sure if there is any hidden issue during quantization or evaluation configuration that caused the performance gap.

Thanks again for your support!

@Eijnewgnaw
Copy link
Author

@Qubitium
INFO ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO:root:Running evaluation on LM_EVAL.GSM8K_COT...
INFO Eval: loading using backend = auto
from_quantized: adapter: None
INFO Loader: Auto dtype (native bfloat16): torch.bfloat16
INFO Estimated Quantization BPW (bits per weight): 4.85 bpw, based on [bits: 4, group_size: 32]
INFO Kernel: Auto-selection: adding candidate TorchQuantLinear
INFO Kernel: candidates -> [TorchQuantLinear]
INFO Kernel: selected -> TorchQuantLinear.
WARNING:accelerate.utils.modeling:The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
INFO Format: Converting checkpoint_format from FORMAT.GPTQ to internal FORMAT.GPTQ_V2.
INFO Format: Converting GPTQ v1 to v2
INFO Format: Conversion complete: 0.009551763534545898s
INFO Kernel: Auto-selection: adding candidate TorchQuantLinear
INFO Optimize: TorchQuantLinear compilation triggered.
INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=128004 (token='<|finetune_right_pad_id|>').
INFO Model: Loaded generation_config: GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
]
}

INFO Model: Auto-fixed generation_config mismatch between model and generation_config.json.
INFO Model: Updated generation_config: GenerationConfig {
"bos_token_id": 128000,
"do_sample": true,
"eos_token_id": [
128001,
128008,
128009
],
"temperature": 0.6,
"top_p": 0.9
}

INFO Kernel: loaded -> [TorchQuantLinear]
WARNING:lm_eval.models.huggingface:pretrained model kwarg is not of type str. Many other model arguments may be ignored. Please do not launch via accelerate or use parallelize=True if passing an existing model this way.
WARNING:lm_eval.models.huggingface:Passed an already-initialized model through pretrained, assuming single-process call to evaluate() or custom distributed integration
INFO LM-EVAL: gen_kwargs = do_sample=True,temperature=0.6,top_k=50,top_p=0.9
INFO LM-EVAL: apply_chat_template = False
INFO:lm_eval.evaluator:Setting random seed to 1234 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
WARNING:lm_eval.evaluator:generation_kwargs: {'do_sample': True, 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9} specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!
INFO:lm_eval.evaluator:Using pre-initialized model
README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.94k/7.94k [00:00<00:00, 72.1MB/s]
train-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:00<00:00, 6.93MB/s]
test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 5.60MB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 264524.47 examples/s]
Generating test split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 483972.27 examples/s]
INFO:lm_eval.evaluator:gsm8k_cot: Using gen_kwargs: {'do_sample': True, 'until': ['Q:', '', '<|im_end|>'], 'temperature': 0.6, 'top_k': 50, 'top_p': 0.9}
INFO:lm_eval.api.task:Building contexts for gsm8k_cot on rank 0...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:09<00:00, 141.81it/s]
INFO:lm_eval.evaluator:Running generate_until requests
Running generate_until requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [35:24<00:00, 1.61s/it]
--------lm_eval Eval Result---------

Tasks Version Filter n-shot Metric Value Stderr
gsm8k_cot 3 flexible-extract 8 exact_match 0.0174 ± 0.0036
strict-match 8 exact_match 0.0174 ± 0.0036

@Eijnewgnaw
Copy link
Author

@Qubitium
and if i use the chat-template the goal will be better but still far away from the report :
INFO ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
INFO ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.
INFO:root:Running evaluation on LM_EVAL.GSM8K_COT...
INFO Eval: loading using backend = auto
from_quantized: adapter: None
INFO Loader: Auto dtype (native bfloat16): torch.bfloat16
INFO Estimated Quantization BPW (bits per weight): 4.85 bpw, based on [bits: 4, group_size: 32]
INFO Kernel: Auto-selection: adding candidate TorchQuantLinear
INFO Kernel: candidates -> [TorchQuantLinear]
INFO Kernel: selected -> TorchQuantLinear.
WARNING:accelerate.utils.modeling:The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
INFO Format: Converting checkpoint_format from FORMAT.GPTQ to internal FORMAT.GPTQ_V2.
INFO Format: Converting GPTQ v1 to v2
INFO Format: Conversion complete: 0.01409006118774414s
INFO Kernel: Auto-selection: adding candidate TorchQuantLinear
INFO Optimize: TorchQuantLinear compilation triggered.
INFO:tokenicer.tokenicer:Tokenicer: Auto fixed pad_token_id=128004 (token='<|finetune_right_pad_id|>').
INFO Model: Loaded generation_config: GenerationConfig {
"bos_token_id": 128000,
"eos_token_id": [
128001,
128008,
128009
]
}

INFO Model: Auto-fixed generation_config mismatch between model and generation_config.json.
INFO Model: Updated generation_config: GenerationConfig {
"bos_token_id": 128000,
"do_sample": true,
"eos_token_id": [
128001,
128008,
128009
],
"temperature": 0.6,
"top_p": 0.9
}

INFO Kernel: loaded -> [TorchQuantLinear]
INFO 05-19 10:36:00 [init.py:248] Automatically detected platform cuda.
WARNING:lm_eval.models.huggingface:pretrained model kwarg is not of type str. Many other model arguments may be ignored. Please do not launch via accelerate or use parallelize=True if passing an existing model this way.
WARNING:lm_eval.models.huggingface:Passed an already-initialized model through pretrained, assuming single-process call to evaluate() or custom distributed integration
INFO LM-EVAL: gen_kwargs = do_sample=True,temperature=0.6,top_k=50,top_p=0.9
INFO LM-EVAL: apply_chat_template = True
INFO:lm_eval.evaluator:Setting random seed to 1234 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
WARNING:lm_eval.evaluator:generation_kwargs specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!
INFO:lm_eval.evaluator:Using pre-initialized model
WARNING:lm_eval.evaluator:Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
INFO:lm_eval.api.task:Building contexts for gsm8k_cot on rank 0...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:08<00:00, 158.38it/s]
INFO:lm_eval.evaluator:Running generate_until requests
Running generate_until requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [47:07<00:00, 2.14s/it]
--------lm_eval Eval Result---------

Tasks Version Filter n-shot Metric Value Stderr
gsm8k_cot 3 flexible-extract 8 exact_match 0.1895 ± 0.0108
strict-match 8 exact_match 0.0751 ± 0.0073

--------lm_eval Result End---------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants