Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix lm eval test reproducbility issues #1260

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Mar 17, 2025

SUMMARY:
lm-eval multimodal tests were failing to reproduce across different versions of compressed tensors. After upgrading the models from 2B to 7B, the tests appear to be reproducing across compressed tensors 0.9.1, 0.9.2 and nightly. I ran extensively for the fp8 config across different versions of CT, and it always returned the same result.

I also removed the random seed from the configs. after running several of each of the 3 configs, i did not see any change in result. this may cause errors during ci/cd testing but I'd like to see if it does, i feel that is a better e2e test anyway.

The Qwen2.5 7B model was erroring out during GPTQ with default dampening_frac=0.01, during Hessian calculation of the model.layers[_].mlp.down_proj matrix which is a Linear(in_features=18944, out_features=3584, bias=False) layer. This happened regardless of shuffling, but works consistently with dampening_frac=0.1. The llava model runs fine with dampening_frac=0.01.

Tests take a long time to run, even with just 30 eval samples it can take 30 minutes to run the model in vl_w4a16_actorder_weight.yaml (~50 minutes total with compression), compared to a couple minutes it takes to run the dense models through lm-eval directly. This will add roughly 2 hours to the weekly testing runtime. Is that expected and acceptable?

TEST PLAN:
no new src code, just fixing tests

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Mar 17, 2025
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should look into the error you're seeing for GPTQ.
We were seeing 0.233 for llava, the entire time we've had this test running?

@kylesayrs
Copy link
Collaborator

@brian-dellabetta The down_proj is the most likely to fail hessian inversion, since it is the weight with the largest input size.

This is likely a normal hessian inevitability issue, which can be fixed by shuffling the dataset/ using an image dataset

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
@brian-dellabetta brian-dellabetta removed the ready When a PR is ready for review label Mar 19, 2025
@brian-dellabetta
Copy link
Collaborator Author

Hitting several issues around reproducibility and slowness of lm-eval (takes about a minute for each of the 30 samples to run). Will continue this next week after resolving more urgent tasks

dsikka pushed a commit that referenced this pull request Mar 21, 2025
SUMMARY:
multi-modal lm-eval tests are failing due to a non-reproducibility issue
that still needs to be resolved. In the meantime, moving those tests to
a skipped folder until resolution.

Resolution can be tracked in #1260 


TEST PLAN:
no new source code

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants