CellAtria is an agentic AI system that enables full-lifecycle, document-to-analysis automation in single-cell research. It integrates natural language interaction with a robust, graph-based, multi-actor execution framework. The system orchestrates diverse tasks, ranging from literature parsing and metadata extraction to dataset retrieval and downstream scRNA-seq analysis via the co-developed CellExpress pipeline.
Through its comprehensive interface, CellAtria empowers users to engage with a language model augmented by task-specific tools. This eliminates the need for manual command-line operations, accelerating data onboarding and the reuse of public single-cell resources.
Language model-mediated orchestration of toolchains. Upon receiving a user prompt, the CellAtria interface transfers the request to the LLM agent, which interprets intent and autonomously invokes relevant tools. Outputs are returned through the interface, completing a full cycle of context-aware execution.
- Flexible Input: Accepts primary research articles as PDFs or URLs for seamless integration.
- Automated Metadata Extraction: Extracts structured metadata, including sample annotations, organism, tissue type, and GEO (Gene Expression Omnibus) accession identifiers.
- Intelligent Data Retrieval: Resolves and organizes GEO datasets by accessing both GSE (study-level) and GSM (sample-level) records, ensuring structured and comprehensive data retrieval.
- Integrated Analysis Pipeline: Orchestrates full pipeline configuration and launches CellExpress, a containerized framework for standardized scRNA-seq analysis, ensuring reproducible results.
- Enhanced User Control: Enables metadata editing, secure file transfers, and direct file system management within the agent session.
- Modular & Reusable Architecture: Composes all core actions into reusable, graph-based tools that serve as callable agent nodes, fostering extensibility.
Additional details on the underlying toolkits can be found in the toolkit reference
- Docker: Install Docker and ensure the Docker daemon is running.
- Environment Configuration: Provide a
.envfile with credentials and parameters (see LLM Configuration section below).
The CellAtria repository includes a GitHub Actions workflow that builds and publishes a preconfigured Docker image to the GitHub Container Registry.
Pull the latest CellAtria Docker image using:
# Run this command in your terminal
docker pull ghcr.io/astrazeneca/cellatria:v1.0.0This image contains all dependencies needed to run the CellAtria agent in a consistent environment.
Start the agent with the following command (replace paths with your actual directories):
# Run this command in your terminal
docker run -it --rm \
-p 7860:7860 \
-v /path/to/your/project/directory:/data \
-v /path/to/your/env/directory:/envdir \
ghcr.io/astrazeneca/cellatria:v1.0.0 cellatria \
--env_path /envdirCommand Breakdown:
-p 7860:7860: Exposes the agent user interface (UI) on port 7860.-v /path/to/your/project/directory:/data: Mounts your project directory into the container.-v /path/to/your/env/directory:/envdir: Mounts your.envdirectory for configuration (see LLM Configuration section below).ghcr.io/astrazeneca/cellatria:v1.0.0 cellatria: Specifies the Docker image and the entrypoint command to launch the app inside the container.--env_path /envdir: Tells agent where to find the.envfile for provider setup.
macOS users with Apple Silicon (M1/M2): You may encounter a warning due to platform mismatch. To ensure compatibility, add
--platform=linux/amd64when running the container (i.e.,docker run --platform=linux/amd64 -it --rm).
Once launched, the agent will initialize and provide a local URL for interaction. Simply open the link printed in your terminal to begin using CellAtria through your browser.
Mounting a Working Directory:
When running the container, any host directory you want the container to access must be explicitly mounted using Docker’s -v (volume) flag. The container can only see and interact with the directories you specify at runtime.
For example, the following command:
-v /absolute/path/on/host:/datamakes the contents of /absolute/path/on/host on your host machine available inside the container at /data.
If you set a working directory inside the container (e.g.,
my_project), make sure to reference it using the container’s path — for instance:/data/my_project. Attempting to access files or directories outside the mounted path from within the container will fail, as they are not visible to the container’s filesystem.
CellAtria requires a .env file to configure access to your chosen LLM provider. You can download the template .env, fill in the necessary credentials and parameters. Ensure the directory containing the .env file is mounted into the container.
azure: Azure OpenAI (enterprise-grade access to GPT models)openai: Standard OpenAI API (e.g., GPT-4, GPT-3.5)anthropic: Claude models via the Anthropic APIgoogle: Gemini models via Google Cloud / Vertex AIlocal: Offline models (e.g., Llama.cpp, Ollama, Hugging Face)
Set the
PROVIDERvariable in your.envfile to one of the supported values above. Only one provider can be active at a time.
You only need to configure the block for the provider you're using. The rest can remain commented.
CellExpress is a companion pipeline embedded within the CellAtria framework. It delivers a reproducible and automated workflow for processing single-cell RNA-seq datasets (scRNA-seq) - from raw count matrices to comprehensive cell type annotations and report generation.
Designed to lower bioinformatics barriers, CellExpress implements a comprehensive set of state-of-the-art, Scanpy-based processing stages, including quality control (performed globally or per sample), data transformation (including normalization, highly variable gene selection, and scaling), dimensionality reduction (UMAP and t-SNE), graph-based clustering, and marker gene identification. Additional tools are integrated to support advanced analysis tasks, including doublet detection, batch correction, and automated cell type annotation using both tissue-agnostic and tissue-specific models. All analytical steps are executed sequentially under centralized control, with parameters fully configurable via a comprehensive input schema.
CellExpress is a fully standalone pipeline for comprehensive scRNA-seq data analysis. It can be orchestrated either through an agentic system - as incorporated into the CellAtria framework - or via direct command-line execution.
To execute the CellExpress pipeline directly using Docker, use the following command:
# Run this command in your terminal
docker run -it --rm \
-v /path/to/your/local/data:/data \
ghcr.io/astrazeneca/cellatria:v1.0.0 cellexpress \
--input /data \
--project your_project_name \
--species `species` \
--tissue `tissue` \
--disease `disease` \
[--additional `options`...]Command Breakdown:
-v /path/to/your/local/data:/data: Mounts your project directory into the container.ghcr.io/astrazeneca/cellatria:v1.0.0 cellexpress: Specifies the Docker image and the entrypoint command to launch CellExpress inside the container.- [--additional
options...]: arguments to configure pipeline.
macOS users with Apple Silicon (M1/M2): You may encounter a warning due to platform mismatch. To ensure compatibility, add
--platform=linux/amd64when running the container (i.e.,docker run --platform=linux/amd64 -it --rm).
For full details, usage instructions, and configuration options, refer to the CellExpress README.
The Dockerfile defines the dedicated computing environment for executing CellAtria and the co-developed CellExpress pipelie in a consistent and reproducible manner.
It includes all required Python and R dependencies, along with support for HTML reporting and visualization.
Built on an Ubuntu-based system, the environment also provides essential system-level packages to support end-to-end
pipeline execution.
While CellAtria supports flexible, user-driven interactions, its functionality is governed by an underlying execution narrative — a structured flow of modular actions that define how tasks are interpreted, routed, and executed. Users may invoke any module independently; however, for optimal results and seamless orchestration, we recommend following the intended workflow trajectory below.
CellAtria's internal logic integrates the following key stages:
- Document Parsing - Extracts structured metadata from narrative-formatted scientific documents (article URL or PDF).
- Accession Resolution - Identifies relevant GEO (Gene Expression Omnibus) accession IDs from parsed metadata.
- Dataset Retrieval - Downloads datasets directly from public repositories.
- File & Data Organization - Structures downloaded content into a consistent directory schema for analysis.
- Pipeline Configuration - Prepares CellExpress arguments and environmental parameters for execution.
- CellExpress Execution - Launches the standardized single-cell analysis pipeline in a detached mode.
This modular, agent-guided framework allows users to begin at any point while preserving logical consistency across steps.
If you use this repository, please cite:
Nima Nouri, et al. (2025). An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis. bioRxiv. https://doi.org/10.1101/2025.07.31.667880
@article{nouri2025agentic,
title={An Agentic AI Framework for Ingestion and Standardization of Single-Cell RNA-seq Data Analysis},
author={Nouri, Nima and Artzi, Ronen and Savova, Virginia},
journal={bioRxiv},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
| Role | Name | Contact |
|---|---|---|
| Author/Maintainer | Nima Nouri | [email protected] |

