SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

📖 Paper • ⌚️ Overview • 📊 Datasets • ⚡ Getting Started • 📝 Cite

📌 Contents

⌚️ Overview
📊 Datasets
⚙️ Evaluation Competencies
⚡ Getting Started
- 🔧 Installation
- 🚀 Running Evaluations
📊 Results
📝 Citation

🆕 News

[2025.05]: SciCUEval is released as a comprehensive dataset for evaluating LLMs' context understanding capability in diverse scientific domains.

⌚️ Overview

SciCUEval is a benchmark designed to evaluate context understanding capabilities of Large Language Models (LLMs) in scientific domains. It addresses the lack of specialized benchmarks for scientific domains by providing:

10 Diverse Datasets: Spanning biology, chemistry, physics, biomedicine, and materials science.
Multiple Data Modalities: Structured tables, knowledge graphs, and unstructured text.
Four Core Competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference
Comprehensive Evaluation: Assessing state-of-the-art LLMs across various scientific tasks.

📊 Datasets

SciCUEval includes ten datasets sourced from different scientific domains and data modalities:

Sub-dataset	Domain	Source	Modality	#Info. Indent.	# Abs. Detec.	# Info. Integ.	# Con. Infer	# Total
MatText	Materials	arXiv	Text	216	146	222	356	940
BioText	Biology	Biorxiv	Text	236	97	318	317	968
MatTab	Materials	Material Project	Table	299	150	287	200	936
IaeaTab	Physics	IAEA	Table	442	222	286	180	1130
ProtTab	Biology	Pubchem	Table	496	249	327	180	1252
MolTab	Chemistry	Pubchem	Table	516	259	350	180	1305
GoKG	Biology	Gene Ontology	KG	507	254	239	180	1180
HipKG	Biology	HIPPLE	KG	470	236	319	140	1165
PhaKG	Biomedicine	PharmKG	KG	512	256	281	168	1217
PriKG	Biomedicine	PrimeKG	KG	410	205	382	253	1250

⚙️ Evaluation Competencies

SciCUEval evaluates LLMs on four key dimensions:

Relevant information Identification: Ability to filter out irrelevant or noisy information.
Information-absence Detection: Refraining from answering when no correct information is retrieved.
Multi-source Information Integration: Synthesizing data from multiple sources.
Context-aware Inference: Performing logical inference on retrieved data.

⚡ Getting Started

🔧 Installation

Clone the repository:

git clone https://github.com/HICAI-ZJU/SciCUEval.git

cd SciCUEval

Install dependencies:

pip install -r requirements.txt

🚀 Running Evaluations

Deploy your model as an OpenAI-compatible server.
Configure API settings in run.py:

Set API_KEY and API_BASE for your model.

Run evaluation:

python run.py --rag [True/False] --model [MODEL_NAME] --dataset [DATASET] --type [COMPETENCIES]

Example:

python run.py --rag True --model qwen2.5-7b-instruct --dataset MatText --type Context-aware_Inference,Relevant_Information_Identification

Generate results:

python eval.py

Results are saved in result.json, dataset_result.txt, and overall_result.txt.

📊 Results

Evaluation results are saved in the output directory. For detailed performance analysis across models and datasets.

/output/
  ├── qwen2.5-7b-instruct-True/
  │    ├── MatText
  │    │    ├── Relevant_Information_Identification.json;
  │    │    ├── Information-absence_Detection.jsonl
  │    │    └── ...
  │    └── MolTab
  └── ...

📝 Citation

If you use SciCUEval in your research, please cite our paper:

@misc{yu2025scicueval,
      title={SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models}, 
      author={Jing Yu and Yuqi Tang and Kehua Feng and Mingyang Rao and Lei Liang and Zhiqiang Zhang and Mengshu Sun and Wen Zhang and Qiang Zhang and Keyan Ding and Huajun Chen},
      year={2025},
      eprint={2505.15094},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.15094}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
figures		figures
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

📌 Contents

🆕 News

⌚️ Overview

📊 Datasets

⚙️ Evaluation Competencies

⚡ Getting Started

🔧 Installation

🚀 Running Evaluations

📊 Results

📝 Citation

About

Uh oh!

Releases

Packages

Languages

HICAI-ZJU/SciCUEval

Folders and files

Latest commit

History

Repository files navigation

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

📌 Contents

🆕 News

⌚️ Overview

📊 Datasets

⚙️ Evaluation Competencies

⚡ Getting Started

🔧 Installation

🚀 Running Evaluations

📊 Results

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages