Skip to content

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Notifications You must be signed in to change notification settings

HICAI-ZJU/SciCUEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

📖 Paper⌚️ Overview📊 Datasets⚡ Getting Started📝 Cite

📌 Contents

🆕 News

  • [2025.05]: SciCUEval is released as a comprehensive dataset for evaluating LLMs' context understanding capability in diverse scientific domains.

⌚️ Overview

SciCUEval is a benchmark designed to evaluate context understanding capabilities of Large Language Models (LLMs) in scientific domains. It addresses the lack of specialized benchmarks for scientific domains by providing:

  • 10 Diverse Datasets: Spanning biology, chemistry, physics, biomedicine, and materials science.

  • Multiple Data Modalities: Structured tables, knowledge graphs, and unstructured text.

  • Four Core Competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference

  • Comprehensive Evaluation: Assessing state-of-the-art LLMs across various scientific tasks.

📊 Datasets

SciCUEval includes ten datasets sourced from different scientific domains and data modalities:

Sub-dataset Domain Source Modality #Info. Indent. # Abs. Detec. # Info. Integ. # Con. Infer # Total
MatText Materials arXiv Text 216 146 222 356 940
BioText Biology Biorxiv Text 236 97 318 317 968
MatTab Materials Material Project Table 299 150 287 200 936
IaeaTab Physics IAEA Table 442 222 286 180 1130
ProtTab Biology Pubchem Table 496 249 327 180 1252
MolTab Chemistry Pubchem Table 516 259 350 180 1305
GoKG Biology Gene Ontology KG 507 254 239 180 1180
HipKG Biology HIPPLE KG 470 236 319 140 1165
PhaKG Biomedicine PharmKG KG 512 256 281 168 1217
PriKG Biomedicine PrimeKG KG 410 205 382 253 1250

⚙️ Evaluation Competencies

SciCUEval evaluates LLMs on four key dimensions:

  1. Relevant information Identification: Ability to filter out irrelevant or noisy information.

  2. Information-absence Detection: Refraining from answering when no correct information is retrieved.

  3. Multi-source Information Integration: Synthesizing data from multiple sources.

  4. Context-aware Inference: Performing logical inference on retrieved data.

⚡ Getting Started

🔧 Installation

  1. Clone the repository:
git clone https://github.com/HICAI-ZJU/SciCUEval.git

cd SciCUEval
  1. Install dependencies:
pip install -r requirements.txt

🚀 Running Evaluations

  1. Deploy your model as an OpenAI-compatible server.

  2. Configure API settings in run.py:

  • Set API_KEY and API_BASE for your model.
  1. Run evaluation:
python run.py --rag [True/False] --model [MODEL_NAME] --dataset [DATASET] --type [COMPETENCIES]

Example:

python run.py --rag True --model qwen2.5-7b-instruct --dataset MatText --type Context-aware_Inference,Relevant_Information_Identification
  1. Generate results:
python eval.py

Results are saved in result.json, dataset_result.txt, and overall_result.txt.

📊 Results

Evaluation results are saved in the output directory. For detailed performance analysis across models and datasets.

/output/
  ├── qwen2.5-7b-instruct-True/
  │    ├── MatText
  │    │    ├── Relevant_Information_Identification.json;
  │    │    ├── Information-absence_Detection.jsonl
  │    │    └── ...
  │    └── MolTab
  └── ...

📝 Citation

If you use SciCUEval in your research, please cite our paper:

@misc{yu2025scicueval,
      title={SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models}, 
      author={Jing Yu and Yuqi Tang and Kehua Feng and Mingyang Rao and Lei Liang and Zhiqiang Zhang and Mengshu Sun and Wen Zhang and Qiang Zhang and Keyan Ding and Huajun Chen},
      year={2025},
      eprint={2505.15094},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.15094}, 
}

About

SciCUEval: A Comprehensive Dataset for Evaluating Scientific Context Understanding in Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages