Reproducible configs for evaluation of instruct models

Hi. I'm trying to reproduce the results form noreval.
From the paper:

```
Generation. The LM generates a text continuation conditioned on an input prompt. We use
a greedy search decoding method for the pretrained LMs and recommended HuggingFace
inference hyperparameters and chat templates
for the instruction-tuned LMs. This strategy is
used in the sentence completion, sequence-tosequence generation, and generative QA **tasks.**
```

 I see in the LM eval harness that the parameters for generation of the base LMs with greedy decoding, 
but I didn't find the configurations you used for the instruct models anywhere. Do you have these at hand? 
Perhaps it would make sense to document these in the repo / all the exact configurations for reproducing the metrics reported in the Noreval paper.

One of the reasons i'm looking into this is that I found the performance gaps between the instruct and non-instruct models very strange. 
One possible explanation is that the prompts are bespoke for non-instruct lms, but I want to rule out that any of the instruct models were accidentally evaluated using the non-instruct params. 
Since these parameters are set in _ask_gec_yaml
```
generation_kwargs:
  until:
    - "\n"
  do_sample: false
  num_beams: 1
  max_new_tokens: 256
```
could it be that any of these kwargs were not successfully overridden when specifying recommended kwargs for instruct models?

<img width="1232" height="1209" alt="Image" src="https://github.com/user-attachments/assets/de10077c-1af7-46f2-9882-88c5dcf0bdee" />

Large performance gaps exhibited are in ask_gec and truthfulqa 

Cheers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducible configs for evaluation of instruct models #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducible configs for evaluation of instruct models #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions