Skip to content

Commit 2caff2b

Browse files
authored
Fix descriptions in the Moral Stories and Histoires Morales tasks. (#3374)
* Update README.md * Clarify evaluation tasks for moral reasoning datasets Updated descriptions for 'histoires_morales' and 'moral_stories' to clarify their purpose in evaluating moral judgment and ethical reasoning.
1 parent 54e606f commit 2caff2b

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

lm_eval/tasks/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ provided to the individual README.md files for each subfolder.
8181
| [hellaswag](hellaswag/README.md) | Tasks to predict the ending of stories or scenarios, testing comprehension and creativity. | English |
8282
| [hendrycks_ethics](hendrycks_ethics/README.md) | Tasks designed to evaluate the ethical reasoning capabilities of models. | English |
8383
| [hendrycks_math](hendrycks_math/README.md) | Mathematical problem-solving tasks to test numerical reasoning and problem-solving. | English |
84-
| [histoires_morales](histoires_morales/README.md) | A dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | French (Some MT) |
84+
| [histoires_morales](histoires_morales/README.md) | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | French (Some MT) |
8585
| [hrm8k](hrm8k/README.md) | A challenging bilingual math reasoning benchmark for Korean and English. | Korean (Some MT), English (Some MT) |
8686
| [humaneval](humaneval/README.md) | Code generation task that measure functional correctness for synthesizing programs from docstrings. | Python |
8787
| [humaneval_infilling](humaneval_infilling/README.md) | Code generation task that measure fill-in-the-middle capability for synthesizing programs from docstrings. | Python |
@@ -131,7 +131,7 @@ provided to the individual README.md files for each subfolder.
131131
| [mmlu_prox](mmlu_prox/README.md) | A multilingual benchmark that extends MMLU-Pro to multiple typologically diverse languages with human validation. | English, Japanese, Chinese, Korean, French, German, Spanish, Portuguese, Zulu, Swahili, Wolof, Yoruba, Thai, Arabic, Hindi, Bengali, Serbian, Hungarian, Vietnamese, Czech, Marathi, Afrikaans, Nepali, Telugu, Urdu, Russian, Indonesian, Italian, Ukrainian |
132132
| [mmlusr](mmlusr/README.md) | Variation of MMLU designed to be more rigorous. | English |
133133
| model_written_evals | Evaluation tasks auto-generated for evaluating a collection of AI Safety concerns. | |
134-
| [moral_stories](moral_stories/README.md) | A crowd-sourced dataset of structured narratives that describe normative and norm-divergent actions taken by individuals to accomplish certain intentions in concrete situations. | English |
134+
| [moral_stories](moral_stories/README.md) | Evaluation tasks designed to assess moral judgment and ethical reasoning in models within narrative contexts, complementing the single-sentence ETHICS tasks. | English |
135135
| [mts_dialog](mts_dialog/README.md) | Open-ended healthcare QA from the MTS-Dialog dataset. | English |
136136
| [multiblimp](multiblimp/README.md) | MultiBLiMP is a (synthetic) multilingual benchmark testing models on linguistic minimal pairs to judge grammatical acceptability | Multiple (101 languages) - Synthetic |
137137
| [mutual](mutual/README.md) | A retrieval-based dataset for multi-turn dialogue reasoning. | English |

0 commit comments

Comments
 (0)