Differences in Model Performance When Reproducing Experiment

Hi, thank you for your nice work!

I'm reproducing the results in Table 2, using Mistral-7B model on MMLU and TydiQA and select 5% data.

<img width="925" alt="image" src="https://github.com/user-attachments/assets/970e3824-773e-4b80-a3c1-b941c9ce75fb">

I adhere to the scripts in your repo to conduct the warmup, data selection and training, and use the evaluation code in your repo to evaluate. I do not change any settings in your script, though only use a random seed of 3.

Despite following these settings, the performance of my model is worse than the results in Table 2. 
For MMLU, the performance of Random is 58.3 (60.0 in your paper), LESS is 60.8 (61.8 in your paper).
For TydiQA, the f1 of Random is 44.6, LESS is 55.1.

My environments are: torch 2.4.0, transformers 4.45.2, peft 0.13.1, datasets 3.0.1

Are these differences reasonable? Could you please confirm if the settings in your scripts are fully aligned with those used in your paper? 

Thanks.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Differences in Model Performance When Reproducing Experiment #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Differences in Model Performance When Reproducing Experiment #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions