EleutherAI · rimashahbazyan · Oct 22, 2024 · Oct 22, 2024 · Oct 22, 2024 · Oct 29, 2024
@@ -107,9 +107,6 @@ def __post_init__(self) -> None:
                 self.generation_kwargs["temperature"] = float(
                     self.generation_kwargs["temperature"]
                 )
-
-            if "until" not in self.generation_kwargs:
-                self.generation_kwargs["until"] = [self.fewshot_delimiter]
         else:
             if self.output_type == "generate_until":
                 # ensure that we greedily generate in absence of explicit arguments otherwise

@@ -97,6 +97,7 @@
 | [race](race/README.md) | Reading comprehension assessment tasks based on English exams in China. | English |
 | realtoxicityprompts | Tasks to evaluate language models for generating text with potential toxicity. | |
 | [sciq](sciq/README.md) | Science Question Answering tasks to assess understanding of scientific concepts. | English |
+| [score](score/README.md) | Systematic consistency and robustness evaluation for LLMs on 3 datasets(MMLU-Pro, Agi Eval and MATH) | English |
 | [scrolls](scrolls/README.md) | Tasks that involve long-form reading comprehension across various domains. | English |
 | [siqa](siqa/README.md) | Social Interaction Question Answering to evaluate common sense and social reasoning.  | English |
 | [spanish_bench](spanish_bench/README.md) | Collection of tasks in Spanish encompassing various evaluation areas. | Spanish |

@@ -0,0 +1,89 @@
+```
+Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+````
+# SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models
+
+
+## Citation
+```bib
+[Citation placeholder]
+```
+
+## Groups
+
+- `score_robustness_mmlu_pro`: two 0-shot robutstness tasks on MMLU-PRO dataset [[1](#mmlu_pro)]
+
+- `score_robustness_agieval`: two 0-shot robutstness tasks on the AGIEVAL datasets [[2](#agi_eval)] multiple choice questions subsets:  `'agieval-sat-math'`, `'agieval-lsat-lr'`, `'agieval-lsat-rc'`, `'agieval-logiqa-en'`, `'agieval-aqua-rat'`, `'agieval-sat-en'`, `'agieval-lsat-ar'`
+
+- `score_robustness_math`: one 0-shot robutstness tasks on Hendryk's MATH dataset [[3](#math)]
+
+## Tasks
+
+Both `score_robustness_mmlu_pro` and `score_robustness_agieval` contain the following 2 tasks:
+
+* Option order robustness:
+`score_option_order_robustness_mmlu_pro`,
+`score_option_order_robustness_agieval`
+
+* Prompt robustness:
+`score_prompt_robustness_mmlu_pro`,
+`score_prompt_robustness_agieval`,
+
+Whereas math contains only
+* Prompt robustness:
+`score_prompt_robustness_math`
+
+
+### Option order robustness
+
+Measures the model's robustness to the placement of the correct answer in the options list by swapping the correct answer with all the other possible options.
+
+### Prompt robustness
+
+Measures the model's robustness to 10 different prompts. list of the prompts can be found in the `./prompt_templates.json` file under the key `prompt_robustness`.
+
+
+## Metrics
+
+All robustness tasks calculate 2 metrics: *Accuracy* and *Consistency Rate(CR)* [[4](#cr)].
+
+$CR = \frac{1}{|Q|} \sum_{Q_k \in Q} \sum_{y_i \in Y_k} \sum_{\substack{y_j \in Y_k \\ j \neq i}}\frac{\text{sim}(y_i, y_j)}{\binom{|Y_k|}{2}}$
+
+## Notes
+
+- All tasks are designed for **Instruct** models for which we recommend to pass "`--apply_chat_template`" flag.
+
+
+## References
+<a name=mmlu_pro></a>[1] Wang, et al. "Mmlu-pro: A more robust and challenging multi-task language understanding benchmark." arXiv preprint arXiv:2406.01574 (2024).
+
+<a name=agi_eval></a>[2] Zhong, et al. "Agieval: A human-centric benchmark for evaluating foundation models." arXiv preprint arXiv:2304.06364 (2023).
+
+<a name=math></a>[3] Hendrycks et al. "Measuring Mathematical Problem Solving With the MATH Dataset." arXiv:2103.03874 (2021).
+
+<a name=cr></a>[4] Yukun et al. "Improving the robustness of large language models via consistency alignment." arXiv:2403.14221 (2024).
+
+## Checklist
+
+For adding novel benchmarks/datasets to the library:
+* [-] Is the task an existing benchmark in the literature?
+  * [-] Have you referenced the original paper that introduced the task? - Will be referenced as soon as the paper is published
+  * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+
+If other tasks on this dataset are already supported:
+* [x] Is the "Main" variant of this task clearly denoted?
+* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+* [x] Have you noted which, if any, published evaluation setups are matched by this variant?
@@ -0,0 +1,46 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task: option_order_robustness_agieval_aqua_rat
+dataset_path: hails/agieval-aqua-rat
+dataset_name: default
+output_type: generate_until
+test_split: test
+process_docs: !function utils_agieval.option_order_robustness_process_docs
+doc_to_text: !function utils_agieval.agi_eval_robustness_doc_to_text
+doc_to_target: answer
+generation_kwargs:
+  max_gen_toks: 1024
+  do_sample: False
+process_results: !function utils_agieval.option_order_robustness_process_results
+metric_list:
+  - metric: per_option_accuracy_A
+    aggregation:  !function utils_agieval.per_option_accuracy_a
+    higher_is_better: true
+  - metric: per_option_accuracy_B
+    aggregation:  !function utils_agieval.per_option_accuracy_b
+    higher_is_better: true
+  - metric: per_option_accuracy_C
+    aggregation:  !function utils_agieval.per_option_accuracy_c
+    higher_is_better: true
+  - metric: per_option_accuracy_D
+    aggregation:  !function utils_agieval.per_option_accuracy_d
+    higher_is_better: true
+  - metric: options_consistency_rate
+    aggregation:  !function utils_agieval.options_consistency_rate
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: option_order_robustness_agieval_aqua_rat.yaml
+task: option_order_robustness_agieval_logiqa_en
+dataset_path: hails/agieval-logiqa-en
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: option_order_robustness_agieval_aqua_rat.yaml
+task: option_order_robustness_agieval_lsat_ar
+dataset_path: hails/agieval-lsat-ar
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: option_order_robustness_agieval_aqua_rat.yaml
+task: option_order_robustness_agieval_lsat_lr
+dataset_path: hails/agieval-lsat-lr
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: option_order_robustness_agieval_aqua_rat.yaml
+task: option_order_robustness_agieval_lsat_rc
+dataset_path: hails/agieval-lsat-rc
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: option_order_robustness_agieval_aqua_rat.yaml
+task: option_order_robustness_agieval_sat_en
+dataset_path: hails/agieval-sat-en
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: option_order_robustness_agieval_aqua_rat.yaml
+task: option_order_robustness_agieval_sat_math
+dataset_path: hails/agieval-sat-math
@@ -0,0 +1,64 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+task: prompt_robustness_agieval_aqua_rat
+dataset_path: hails/agieval-aqua-rat
+dataset_name: default
+output_type: generate_until
+test_split: test
+process_docs: !function utils_agieval.prompt_robustness_process_docs
+doc_to_text: !function utils_agieval.agi_eval_robustness_doc_to_text
+doc_to_target: answer
+generation_kwargs:
+  max_gen_toks: 1024
+  do_sample: False
+process_results: !function utils_agieval.prompt_robustness_process_results
+metric_list:
+  - metric: 0_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_0
+    higher_is_better: true
+  - metric: 1_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_1
+    higher_is_better: true
+  - metric: 2_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_2
+    higher_is_better: true
+  - metric: 3_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_3
+    higher_is_better: true
+  - metric: 4_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_4
+    higher_is_better: true
+  - metric: 5_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_5
+    higher_is_better: true
+  - metric: 6_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_6
+    higher_is_better: true
+  - metric: 7_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_7
+    higher_is_better: true
+  - metric: 8_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_8
+    higher_is_better: true
+  - metric: 9_accuracy
+    aggregation:  !function utils_agieval.per_prompt_accuracy_9
+    higher_is_better: true
+  - metric: consistency_rate
+    aggregation:  !function utils_agieval.agi_eval_prompt_consistency_rate
+    higher_is_better: true
+metadata:
+  version: 1.0
+dataset_kwargs:
+  trust_remote_code: true
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: prompt_robustness_agieval_aqua_rat.yaml
+task: prompt_robustness_agieval_logiqa_en
+dataset_path: hails/agieval-logiqa-en
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: prompt_robustness_agieval_aqua_rat.yaml
+task: prompt_robustness_agieval_lsat_rc
+dataset_path: hails/agieval-lsat-rc
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: prompt_robustness_agieval_aqua_rat.yaml
+task: prompt_robustness_agieval_lsat_ar
+dataset_path: hails/agieval-lsat-ar
@@ -0,0 +1,17 @@
+# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#    http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+include: prompt_robustness_agieval_aqua_rat.yaml
+task: prompt_robustness_agieval_lsat_lr
+dataset_path: hails/agieval-lsat-lr