Skip to content

Reproducible configs for evaluation of instruct models #3

@marksverdhei

Description

@marksverdhei

Hi. I'm trying to reproduce the results form noreval.
From the paper:

Generation. The LM generates a text continuation conditioned on an input prompt. We use
a greedy search decoding method for the pretrained LMs and recommended HuggingFace
inference hyperparameters and chat templates
for the instruction-tuned LMs. This strategy is
used in the sentence completion, sequence-tosequence generation, and generative QA **tasks.**

I see in the LM eval harness that the parameters for generation of the base LMs with greedy decoding,
but I didn't find the configurations you used for the instruct models anywhere. Do you have these at hand?
Perhaps it would make sense to document these in the repo / all the exact configurations for reproducing the metrics reported in the Noreval paper.

One of the reasons i'm looking into this is that I found the performance gaps between the instruct and non-instruct models very strange.
One possible explanation is that the prompts are bespoke for non-instruct lms, but I want to rule out that any of the instruct models were accidentally evaluated using the non-instruct params.
Since these parameters are set in _ask_gec_yaml

generation_kwargs:
  until:
    - "\n"
  do_sample: false
  num_beams: 1
  max_new_tokens: 256

could it be that any of these kwargs were not successfully overridden when specifying recommended kwargs for instruct models?

Image

Large performance gaps exhibited are in ask_gec and truthfulqa

Cheers

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions