Fix descriptions in the Moral Stories and Histoires Morales tasks. (#3374)

upunaprosk · web-flow · commit 2caff2baa2f9 · 2025-11-19T09:13:29.000+05:00
* Update README.md

* Clarify evaluation tasks for moral reasoning datasets

Updated descriptions for 'histoires_morales' and 'moral_stories' to clarify their purpose in evaluating moral judgment and ethical reasoning.
diff --git a/lm_eval/tasks/README.md b/lm_eval/tasks/README.md
@@ -81,7 +81,7 @@ provided to the individual README.md files for each subfolder.
 | [hellaswag](hellaswag/README.md)                                         | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity.                                                                                                                                                                                                                                             | English                                                                                                                                                                                                                                                       |
 | [hendrycks_ethics](hendrycks_ethics/README.md)                           | Tasks designed to evaluate the ethical reasoning capabilities of models.                                                                                                                                                                                                                                                               | English                                                                                                                                                                                                                                                       |
 | [hendrycks_math](hendrycks_math/README.md)                               | Mathematical problem-solving tasks to test numerical reasoning and problem-solving.                                                                                                                                                                                                                                                    | English                                                                                                                                                                                                                                                       |
-| [histoires_morales](histoires_morales/README.md)                         | A dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations.                                                                                                                                                                    | French (Some MT)                                                                                                                                                                                                                                              |
+| [histoires_morales](histoires_morales/README.md)                         | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks.                                                                                                                                                                    | French (Some MT)                                                                                                                                                                                                                                              |
 | [hrm8k](hrm8k/README.md)                                                 | A challenging bilingual math reasoning benchmark for Korean and English.                                                                                                                                                                                                                                                               | Korean (Some MT), English (Some MT)                                                                                                                                                                                                                           |
 | [humaneval](humaneval/README.md)                                         | Code generation task that measure functional correctness for synthesizing programs from docstrings.                                                                                                                                                                                                                                    | Python                                                                                                                                                                                                                                                        |
 | [humaneval_infilling](humaneval_infilling/README.md)                     | Code generation task that measure fill-in-the-middle capability for synthesizing programs from docstrings.                                                                                                                                                                                                                             | Python                                                                                                                                                                                                                                                        |
@@ -131,7 +131,7 @@ provided to the individual README.md files for each subfolder.
 | [mmlu_prox](mmlu_prox/README.md)                                         | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation.                                                                                                                                                                                                                      | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian |
 | [mmlusr](mmlusr/README.md)                                               | Variation of MMLU designed to be more rigorous.                                                                                                                                                                                                                                                                                        | English                                                                                                                                                                                                                                                       |
 | model_written_evals                                                      | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns.                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                               |
-| [moral_stories](moral_stories/README.md)                                 | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations.                                                                                                                                                      | English                                                                                                                                                                                                                                                       |
+| [moral_stories](moral_stories/README.md)                                 | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks.                                                                                                                                                      | English                                                                                                                                                                                                                                                       |
 | [mts_dialog](mts_dialog/README.md)                                       | Open-ended healthcare QA from the MTS-Dialog dataset.                                                                                                                                                                                                                                                                                  | English                                                                                                                                                                                                                                                       |
 | [multiblimp](multiblimp/README.md)                                       | MultiBLiMP is a (synthetic) multilingual benchmark testing models on linguistic minimal pairs to judge grammatical acceptability                                                                                                                                                                                                       | Multiple (101 languages) - Synthetic                                                                                                                                                                                                                          |
 | [mutual](mutual/README.md)                                               | A retrieval-based dataset for multi-turn dialogue reasoning.                                                                                                                                                                                                                                                                           | English                                                                                                                                                                                                                                                       |