-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi. I'm trying to reproduce the results form noreval.
From the paper:
Generation. The LM generates a text continuation conditioned on an input prompt. We use
a greedy search decoding method for the pretrained LMs and recommended HuggingFace
inference hyperparameters and chat templates
for the instruction-tuned LMs. This strategy is
used in the sentence completion, sequence-tosequence generation, and generative QA **tasks.**
I see in the LM eval harness that the parameters for generation of the base LMs with greedy decoding,
but I didn't find the configurations you used for the instruct models anywhere. Do you have these at hand?
Perhaps it would make sense to document these in the repo / all the exact configurations for reproducing the metrics reported in the Noreval paper.
One of the reasons i'm looking into this is that I found the performance gaps between the instruct and non-instruct models very strange.
One possible explanation is that the prompts are bespoke for non-instruct lms, but I want to rule out that any of the instruct models were accidentally evaluated using the non-instruct params.
Since these parameters are set in _ask_gec_yaml
generation_kwargs:
until:
- "\n"
do_sample: false
num_beams: 1
max_new_tokens: 256
could it be that any of these kwargs were not successfully overridden when specifying recommended kwargs for instruct models?
Large performance gaps exhibited are in ask_gec and truthfulqa
Cheers