-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix lm eval test reproducbility issues #1260
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should look into the error you're seeing for GPTQ.
We were seeing 0.233 for llava, the entire time we've had this test running?
@brian-dellabetta The down_proj is the most likely to fail hessian inversion, since it is the weight with the largest input size. This is likely a normal hessian inevitability issue, which can be fixed by shuffling the dataset/ using an image dataset |
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Hitting several issues around reproducibility and slowness of lm-eval (takes about a minute for each of the 30 samples to run). Will continue this next week after resolving more urgent tasks |
SUMMARY: multi-modal lm-eval tests are failing due to a non-reproducibility issue that still needs to be resolved. In the meantime, moving those tests to a skipped folder until resolution. Resolution can be tracked in #1260 TEST PLAN: no new source code Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
SUMMARY:
lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result.
I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway.
The Qwen2.5 7B model was erroring out during GPTQ with default
dampening_frac=0.01
, during Hessian calculation of themodel.layers[_].mlp.down_proj
matrix which is aLinear(in_features=18944, out_features=3584, bias=False)
layer. This happened regardless of shuffling, but works consistently withdampening_frac=0.1
. The llava model runs fine withdampening_frac=0.01
.Tests take a long time to run, even with just 30 eval samples it can take 30 minutes to run the model in
vl_w4a16_actorder_weight.yaml
(~50 minutes total with compression), compared to a couple minutes it takes to run the dense models through lm-eval directly. This will add roughly 2 hours to the weekly testing runtime. Is that expected and acceptable?TEST PLAN:
no new src code, just fixing tests