You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m new to NLP and currently reviewing papers like SimPO that use AlpacaEval2 for evaluation. I have two questions:
Is GPT-4-1106-preview the default judge model in AlpacaEval2?
Many recent papers (e.g., SimPO) seem to rely on GPT-4 for evaluation. Is it specifically the gpt-4-1106-preview version, or another variant?
If GPT-4-1106-preview is unavailable, what are the alternatives?
For fairness and reproducibility, what models do researchers typically use instead?
Would appreciate any insights or references to papers addressing this! Thanks!