-
Notifications
You must be signed in to change notification settings - Fork 90
Cannot Replicate Reported GSM8K-CoT Results from HF Model Using GPTQModel Codebase #1560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Eijnewgnaw Please note the model you posted was lm-eval tested using vllm. You should test using the same inference tool as different tooling will cause lm-eval differences. Secondly, are you unable to reproduce using your own quantized model or the actual model weights from our HF model? https://huggingface.co/ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5 |
@Qubitium Thanks for your reply! I understand that you used vLLM for the official evaluation. Regarding the model used: As next steps: I will set up a compatible vLLM environment and re-run evaluation using vLLM as backend to fully align with your setup. Meanwhile, I would really appreciate it if you could kindly help analyze possible reasons for my evaluation results being significantly lower than expected, based on the logs (quant.log and eval.log) I provided. I want to make sure if there is any hidden issue during quantization or evaluation configuration that caused the performance gap. Thanks again for your support! |
@Qubitium INFO Model: Auto-fixed INFO Kernel: loaded ->
|
@Qubitium INFO Model: Auto-fixed INFO Kernel: loaded ->
--------lm_eval Result End--------- |
Uh oh!
There was an error while loading. Please reload this page.
Hi,
Firstly, I want to commend you on the incredible work you've done with this GPTQModel. It's a truly innovative approach, and I’m very excited about the potential it holds.
I’ve been working with the Meta-Llama-3.2-1B-Instruct model from the HuggingFace page below, and I’m having some issues replicating the reported results for GSM8K-CoT:
https://huggingface.co/ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5
I am using the model from HuggingFace and performing quantization using the author’s codebase as outlined in the README, with exactly the same hyperparameters as those provided in the link above. Despite this, I’m seeing a significant performance gap (~20%) on GSM8K-CoT. This difference is not just isolated to GSM8K, but also appears in HumanEval and ARC-Challenge, with several percentage points of drop in performance on those tasks as well.
What I did
Model: Meta-Llama-3.2-1B-Instruct from HuggingFace
Quantization: GPTQ with bits=4, group_size=32, desc_act=True, static_groups=True, following the author's README example for quantization
Kernel: Auto-selected MarlinQuantLinear
Evaluation: lm-eval with task gsm8k_cot_llama
Calibration datasets: tested both wikitext2 and c4
Sampling settings:
do_sample=False
temperature=None
top_p=None
(these were set to ensure reproducibility and reduce sampling variance — the gap persists even with recommended settings)
My questions
Was this model quantized after instruction tuning or CoT fine-tuning?
Was any special prompt formatting or chat template applied during the evaluation phase?
Are there any internal differences that might explain the ~20% performance gap I’m seeing on GSM8K-CoT, and the similar drops on other tasks like HumanEval / ARC-Challenge?
Reason for urgency
I am planning to build upon your codebase and make some innovative modifications to explore new research directions. Given the potential for meaningful contributions to the field, I am eager to resolve this issue as soon as possible to continue with the next steps in my work.
I would greatly appreciate your help, and I’m hoping for your guidance on what might be causing this discrepancy. Your timely response would be incredibly valuable for my ongoing project.
Thank you so much for your time and for sharing such an impactful codebase!
Best regards,
A struggling graduate student
The text was updated successfully, but these errors were encountered: