Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions lm_eval/tasks/sciknoweval_mcqa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# SciKnowEval_mcqa

This task was submitted at the [NeurIPS 2025 E2LM](https://e2lmc.github.io/) competition, and reached $3^{rd}$ place on the general leaderboard.
Its intended use is within the context of Small Language Model (SLM) evaluation in early training stages. More details are provided in the competition [proposal paper](https://arxiv.org/pdf/2506.07731).

### Benchmark details

This task uses a subset of the [SciKnowEval](https://huggingface.co/datasets/hicai-zju/SciKnowEval) dataset. Specifically, it filters out non-MCQA samples and focuses on questions from levels L1, L2, and L3, which are designed to assess knowledge memory, comprehension and reasoning respectively, as described in the original [paper](https://arxiv.org/pdf/2406.09098v2).

The full SciKnowEval dataset is a comprehensive benchmark for evaluating the scientific knowledge reasoning capabilities of Large Language Models (LLMs). It spans across a few scientific domains: Physics, Chemistry, Biology and Materials.

SciKnowEval_mcqa dataset: https://huggingface.co/datasets/ShAIkespear/SciKnowEval_mcqa

### Citation

```
@misc{sci-know-2025-mcqa,
title = "SciKnowEval_mcqa: A Benchmark for Small Language Model Evaluation in their Early Training Stages",
author = "Anthony Kalaydjian, Eric Saikali",
year = "2025",
}
```

### Groups and Tasks

#### Groups

* `sciknoweval_mcqa`: Evaluates `sciknoweval_Biology`, `sciknoweval_Chemistry`, `sciknoweval_Materials` and `sciknoweval_Physics`

#### Tasks
* `sciknoweval_Biology`: Data across all remaining splits corresponding to Biology MCQs.
* `sciknoweval_Chemistry`: Data across all remaining splits corresponding to Chemistry MCQs.
* `sciknoweval_Materials`: Data across all remaining splits corresponding to Materials MCQs.
* `sciknoweval_Physics`: Data across all remaining splits corresponding to Physics MCQs.

### Checklist

For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [x] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
9 changes: 9 additions & 0 deletions lm_eval/tasks/sciknoweval_mcqa/_sciknoweval_mcqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
group: sciknoweval_mcqa
group_alias: sciknoweval_mcqa (var5shots)
task:
- sciknoweval_mcqa_task
aggregate_metric_list:
- metric: acc
weight_by_size: True
metadata:
version: 2
14 changes: 14 additions & 0 deletions lm_eval/tasks/sciknoweval_mcqa/_var5shots_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
dataset_path: ShAIkespear/SciKnowEval_mcqa
output_type: multiple_choice
test_split: test
fewshot_split: dev
num_fewshot: 5
fewshot_config:
sampler: first_n
doc_to_text: "Question: {{question.strip()}}\nAnswer:"
doc_to_choice: "{{choices}}"
doc_to_target: "{{answer}}"
metadata:
version: 1.0
dataset_kwargs:
trust_remote_code: true
7 changes: 7 additions & 0 deletions lm_eval/tasks/sciknoweval_mcqa/sciknoweval_Biology.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "Biology"
"description": "The following are multiple choice questions (with answers) about Biology.\n\
\n"
"include": "_var5shots_template_yaml"
"tag": "sciknoweval_mcqa_task"
"task": "sciknoweval_mcqa_var5shots_Biology"
"task_alias": "Biology"
7 changes: 7 additions & 0 deletions lm_eval/tasks/sciknoweval_mcqa/sciknoweval_Chemistry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "Chemistry"
"description": "The following are multiple choice questions (with answers) about Chemistry.\n\
\n"
"include": "_var5shots_template_yaml"
"tag": "sciknoweval_mcqa_task"
"task": "sciknoweval_mcqa_var5shots_Chemistry"
"task_alias": "Chemistry"
7 changes: 7 additions & 0 deletions lm_eval/tasks/sciknoweval_mcqa/sciknoweval_Material.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "Material"
"description": "The following are multiple choice questions (with answers) about Material.\n\
\n"
"include": "_var5shots_template_yaml"
"tag": "sciknoweval_mcqa_task"
"task": "sciknoweval_mcqa_var5shots_Material"
"task_alias": "Material"
7 changes: 7 additions & 0 deletions lm_eval/tasks/sciknoweval_mcqa/sciknoweval_Physics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "Physics"
"description": "The following are multiple choice questions (with answers) about Physics.\n\
\n"
"include": "_var5shots_template_yaml"
"tag": "sciknoweval_mcqa_task"
"task": "sciknoweval_mcqa_var5shots_Physics"
"task_alias": "Physics"
Loading