SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models
📖 Paper • ⌚️ Overview • 📊 Datasets • ⚡ Getting Started • 📝 Cite
- [2025.05]: SciCUEval is released as a comprehensive dataset for evaluating LLMs' context understanding capability in diverse scientific domains.
SciCUEval is a benchmark designed to evaluate context understanding capabilities of Large Language Models (LLMs) in scientific domains. It addresses the lack of specialized benchmarks for scientific domains by providing:
-
10 Diverse Datasets: Spanning biology, chemistry, physics, biomedicine, and materials science.
-
Multiple Data Modalities: Structured tables, knowledge graphs, and unstructured text.
-
Four Core Competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference
-
Comprehensive Evaluation: Assessing state-of-the-art LLMs across various scientific tasks.
SciCUEval includes ten datasets sourced from different scientific domains and data modalities:
Sub-dataset | Domain | Source | Modality | #Info. Indent. | # Abs. Detec. | # Info. Integ. | # Con. Infer | # Total |
---|---|---|---|---|---|---|---|---|
MatText | Materials | arXiv | Text | 216 | 146 | 222 | 356 | 940 |
BioText | Biology | Biorxiv | Text | 236 | 97 | 318 | 317 | 968 |
MatTab | Materials | Material Project | Table | 299 | 150 | 287 | 200 | 936 |
IaeaTab | Physics | IAEA | Table | 442 | 222 | 286 | 180 | 1130 |
ProtTab | Biology | Pubchem | Table | 496 | 249 | 327 | 180 | 1252 |
MolTab | Chemistry | Pubchem | Table | 516 | 259 | 350 | 180 | 1305 |
GoKG | Biology | Gene Ontology | KG | 507 | 254 | 239 | 180 | 1180 |
HipKG | Biology | HIPPLE | KG | 470 | 236 | 319 | 140 | 1165 |
PhaKG | Biomedicine | PharmKG | KG | 512 | 256 | 281 | 168 | 1217 |
PriKG | Biomedicine | PrimeKG | KG | 410 | 205 | 382 | 253 | 1250 |
SciCUEval evaluates LLMs on four key dimensions:
-
Relevant information Identification: Ability to filter out irrelevant or noisy information.
-
Information-absence Detection: Refraining from answering when no correct information is retrieved.
-
Multi-source Information Integration: Synthesizing data from multiple sources.
-
Context-aware Inference: Performing logical inference on retrieved data.
- Clone the repository:
git clone https://github.com/HICAI-ZJU/SciCUEval.git
cd SciCUEval
- Install dependencies:
pip install -r requirements.txt
-
Deploy your model as an OpenAI-compatible server.
-
Configure API settings in
run.py
:
- Set
API_KEY
andAPI_BASE
for your model.
- Run evaluation:
python run.py --rag [True/False] --model [MODEL_NAME] --dataset [DATASET] --type [COMPETENCIES]
Example:
python run.py --rag True --model qwen2.5-7b-instruct --dataset MatText --type Context-aware_Inference,Relevant_Information_Identification
- Generate results:
python eval.py
Results are saved in result.json
, dataset_result.txt
, and overall_result.txt
.
Evaluation results are saved in the output
directory. For detailed performance analysis across models and datasets.
/output/
├── qwen2.5-7b-instruct-True/
│ ├── MatText
│ │ ├── Relevant_Information_Identification.json;
│ │ ├── Information-absence_Detection.jsonl
│ │ └── ...
│ └── MolTab
└── ...
If you use SciCUEval in your research, please cite our paper:
@misc{yu2025scicueval,
title={SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models},
author={Jing Yu and Yuqi Tang and Kehua Feng and Mingyang Rao and Lei Liang and Zhiqiang Zhang and Mengshu Sun and Wen Zhang and Qiang Zhang and Keyan Ding and Huajun Chen},
year={2025},
eprint={2505.15094},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.15094},
}