Single-Cell Omics Arena (SOAR) is a comprehensive benchmark framework designed to evaluate and improve the performance of instruction-tuned large language models (LLMs) in automated cell type annotation from single-cell omics data.
- [2025-08-03] 🎉 We are excited to provide a command line interface and the pre-built package
soar_benchmarkfor easier usages. - [2025-05-12] 🎉 We are excited to announce that Single-Cell Omics Arena (SOAR) is now open-source! We welcome contributions from the community to help advance automated cell type annotation using LLMs.
- Create an environment with Python >= 3.11
- Clone the repo via
git clone [email protected]:jhliu17/SOAR.git - Install
soar_benchmarkviapip install -e .
To execute LLMs provided by OpenAI or hosted on Hugging Face Transformers, an env file (env.toml) should be set up in the
project folder. A template env file is provided as env_sample.toml.
To print out all supported LLMs, please run:
soar annotate -hThis will print out all available annotation options using LLMs.
soar currently supports the following LLMs for cell type annotation:
- Qwen2 series (1.5B, 7B, 72B)
- Meta Llama-3 70B
- Mixtral-8x7B
- GPT-4o
- GPT-4o-mini
where all LLMs can leverage zero-shot prompting or zero-shot chain-of-thought prompting to finish cell type annotations.
Each model has specific hardware requirements and configurations:
- Smaller models (1.5B-7B): Single GPU with 16GB VRAM
- Larger models (70B-72B): Multi-GPU setup with 4-8 GPUs
- Mixtral-8x7B: 4 GPUs recommended for optimal performance
For example, to reproduce the SOAR-RNA benchmark result on GPT-4o using zero-shot prompting, one can run
soar annotate soar_rna_with_gpt4_o_zero_shotIf one would like to leverage a provided LLM annotation configuration to annotate their own dataset, this can be achieved by
soar annotate soar_rna_with_gpt4_o_zero_shot --config.dataset.json-path YOUR_DATASET_PATHwhere the custom dataset should follow the same structure as soar_benchmark/datasets/soar_rna.json.
One can further fine-tune the preset configuration by overriding some arguments. For example, increasing the new token number limit to 2048.
soar annotate soar_rna_with_gpt4_o_zero_shot --config.generation.max-new-tokens 2048
# To see more tunable options
soar annotate soar_rna_with_gpt4_o_zero_shot -hIf you would like to implement a custom annotation configuration. Please refer to the detailed configuration settings including batch sizes, memory requirements, and hardware specifications in:
- Model configs:
soar_benchmark/configs/cell_type_annotation/experiment_soar_rna.py
Once you have implemented a custom configuration, you can use it by calling the built-in annotation function
from soar_benchmark import start_annotation_task
# Your custom configuration
custom_configuration = CellTypeAnnotationTaskConfig(...)
# Start annotation
start_annotation_task(custom_configuration)To run evaluations on annotated results, please refer to
python -m analysis.cell_type_annotation.squad_eval --chat_results_path outputs/.../qwen2-72b-instruct.json --squad_eval_results_path outputs/.../few_shot_squad_eval_inflect.jsonTo evaluate the free-format cell type annotations generated by LLMs, we employ seven widely used metrics from natural language processing and question answering. These include ROUGE (R-1, R-2, R-L) for measuring n-gram and sequence overlap, METEOR for capturing semantic similarity through surface forms, stems, and synonyms, and BLEU (BLEU-1, BLEU-2, and geometric average) for n-gram overlap, particularly suited for short phrases. Additionally, the Exact Match (EM) and F1 score are used to assess token-level precision and recall, ensuring fair evaluation despite label variability and synonym usage.
| Model | R-1 | R-2 | R-L | MET. | B-1 | B-2 | BLEU |
|---|---|---|---|---|---|---|---|
| CellMarker2.0 | 31.88 | 13.36 | 31.76 | 23.83 | 41.07 | 18.05 | 27.23 |
| SingleR | 16.51 | 5.98 | 16.49 | 2.96 | 24.41 | 0.00 | 0.00 |
| ScType | 12.37 | 3.44 | 12.24 | 20.18 | 21.47 | 6.73 | 10.77 |
| DeepSeek-LLM-67B | 33.13 | 13.47 | 32.74 | 24.27 | 28.27 | 10.07 | 16.87 |
| Qwen2-72B | 32.39 | 14.76 | 32.05 | 29.96 | 18.59 | 6.67 | 11.13 |
| Llama-3-70B | 30.16 | 13.45 | 29.83 | 27.35 | 22.31 | 8.85 | 14.33 |
| Mixtral-8×7B | 20.95 | 13.40 | 20.78 | 16.94 | 17.61 | 7.16 | 10.23 |
| Mixtral-8×22B | 39.85 | 18.19 | 39.95 | 28.60 | 42.06 | 19.40 | 29.18 |
| Cell2Sentence | 26.87 | 11.48 | 26.76 | 19.45 | 25.24 | 11.79 | 17.25 |
| GPT-4o mini | 52.63 | 27.45 | 52.26 | 41.08 | 45.74 | 23.29 | 32.64 |
| GPT-4o | 58.45 | 32.07 | 58.12 | 45.39 | 62.85 | 42.68 | 51.79 |
| Model | R-1 | R-2 | R-L | MET. | B-1 | B-2 | BLEU |
|---|---|---|---|---|---|---|---|
| CellMarker2.0 | - | - | - | - | - | - | - |
| SingleR | - | - | - | - | - | - | - |
| ScType | - | - | - | - | - | - | - |
| DeepSeek-LLM-67B | 40.79 | 17.50 | 40.47 | 31.13 | 33.72 | 13.10 | 21.02 |
| Qwen2-72B | 46.56 | 23.93 | 46.34 | 37.09 | 36.85 | 17.92 | 25.69 |
| Llama-3-70B | 42.25 | 21.24 | 42.02 | 34.09 | 25.94 | 11.64 | 17.38 |
| Mixtral-8×7B | 42.37 | 21.37 | 41.82 | 35.45 | 31.57 | 13.83 | 20.90 |
| Mixtral-8×22B | 51.65 | 26.73 | 51.26 | 41.97 | 40.96 | 19.40 | 28.19 |
| Cell2Sentence | - | - | - | - | - | - | - |
| GPT-4o mini | 51.63 | 26.60 | 51.17 | 40.84 | 50.29 | 27.89 | 37.45 |
| GPT-4o | 57.67 | 31.55 | 57.34 | 45.36 | 55.27 | 32.15 | 42.15 |
| Model | R-1 | R-2 | R-L | MET. | B-1 | B-2 | BLEU |
|---|---|---|---|---|---|---|---|
| Qwen2-72B | 21.83 | 6.02 | 20.57 | 11.54 | 20.77 | 5.32 | 10.51 |
| Llama-3-70B | 27.41 | 11.71 | 27.55 | 17.73 | 27.41 | 10.10 | 16.64 |
| Mixtral-8×7B | 33.41 | 18.28 | 33.65 | 26.11 | 33.09 | 18.45 | 24.71 |
| Mixtral-8×22B | 27.66 | 11.29 | 27.67 | 16.27 | 30.63 | 8.06 | 12.90 |
| Cell2Sentence | 28.03 | 18.29 | 28.42 | 20.63 | 41.35 | 35.29 | 38.20 |
| GPT-4o mini | 39.63 | 21.28 | 39.26 | 29.92 | 37.40 | 21.05 | 28.06 |
| GPT-4o | 41.00 | 23.10 | 41.37 | 30.20 | 43.75 | 26.32 | 33.93 |
| Model | R-1 | R-2 | R-L | MET. | B-1 | B-2 | BLEU |
|---|---|---|---|---|---|---|---|
| Qwen2-72B | 19.55 | 8.64 | 18.43 | 11.80 | 16.07 | 3.79 | 7.80 |
| Llama-3-70B | 31.76 | 13.01 | 31.83 | 19.39 | 29.84 | 10.23 | 17.47 |
| Mixtral-8×7B | 29.71 | 14.95 | 29.36 | 24.21 | 23.08 | 9.77 | 15.02 |
| Mixtral-8×22B | 30.38 | 13.13 | 30.22 | 19.39 | 24.08 | 8.33 | 12.99 |
| Cell2Sentence | 26.54 | 10.07 | 26.45 | 17.29 | 36.04 | 22.67 | 28.58 |
| GPT-4o mini | 38.15 | 17.01 | 37.06 | 28.08 | 35.46 | 17.14 | 24.66 |
| GPT-4o | 38.47 | 16.88 | 38.31 | 25.23 | 38.84 | 21.18 | 28.68 |