Skip to content

Differences in Model Performance When Reproducing Experiment #32

Open
@fannie1208

Description

@fannie1208

Hi, thank you for your nice work!

I'm reproducing the results in Table 2, using Mistral-7B model on MMLU and TydiQA and select 5% data.

image

I adhere to the scripts in your repo to conduct the warmup, data selection and training, and use the evaluation code in your repo to evaluate. I do not change any settings in your script, though only use a random seed of 3.

Despite following these settings, the performance of my model is worse than the results in Table 2.
For MMLU, the performance of Random is 58.3 (60.0 in your paper), LESS is 60.8 (61.8 in your paper).
For TydiQA, the f1 of Random is 44.6, LESS is 55.1.

My environments are: torch 2.4.0, transformers 4.45.2, peft 0.13.1, datasets 3.0.1

Are these differences reasonable? Could you please confirm if the settings in your scripts are fully aligned with those used in your paper?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions