Discovera is an interactive, agent-based system that integrates traditional bioinformatics tools with large language models (LLMs) and retrieval-augmented generation (RAG) to support hypothesis generation and mechanistic discovery in functional genomics.
Presented at ICLR 2025 MLGenX Workshop
🔗 Read the Paper · 🖼️ View the Poster
Bridges the gap between computational analysis and interpretability in biomedical research. It is designed to assist researchers—regardless of their coding expertise—in:
- Interactively exploring gene sets associated with complex phenotypes
- Conducting functional enrichment analyses
- Summarizing mechanistic hypotheses using evidence from literature
- Formulating data-grounded biological insights
- ⚙️ Modular System: Combines established tools (e.g., GSEApy, INDRA) with custom modules for extensibility.
- 📖 LLM Integration: Uses LLMs for natural language reasoning and explanation generation.
- 🔎 Retrieval-Augmented Generation: Grounds summaries in real literature to improve accuracy and transparency.
- 💬 Chat Interface: Enables intuitive, dialogue-based exploration of hypotheses and gene set functions.
In our initial deployment, this agent was used in the context of endometrial carcinoma (EC) to:
- Analyze gene sets linked to phenotypic features
- Perform enrichment analysis on the resulting sets
- Summarize literature-supported mechanisms of action
The system consists of:
- Enrichment tools (e.g., GSEApy)
- INDRA for biological statement synthesis
- LLM-enabled prompt orchestration
- Chat-based user interface
- Python 3.8+
- Docker & Docker Compose
- OpenAI API key (for LLM integration)
Clone the repository:
git clone https://github.com/VIDA-NYU/discovera.git
cd discovera
Copy the configuration template:
cp .beaker.conf.template .beaker.conf
Open .beaker.conf in a text editor and add your OpenAI API key, if using OpenAI:
api_key = "your-api-key-here"
This configuration will be used when launching Docker.
To use Discovera with the Beaker context setup:
docker compose build
docker compose up -d
Once running, navigate to http://localhost:8888 and select discovera
.
Try the following steps to see how BKD-Agent assists in biomedical discovery:
-
Load Gene Expression Data
- Load
gene_expression.csv
into a Pandas DataFrame.
- Load
-
Gene Set Enrichment Analysis (GSEA)
- Use the following parameters:
gene_sets
:GO_Biological_Process_2023
hit_column
:hit
corr_column
:corr
min_set
:5
max_set
:2000
- Use the following parameters:
-
Identify Lead Genes from Top Pathway
- Extract lead genes from the most statistically significant pathway.
-
Refine Gene List Based on Correlation
- Identify top 20 most correlated genes, excluding those already in the lead gene set.
-
Retrieve Documented Gene Relationships
- Example gene pair:
["CTNNB1", "GLCE"]
- Example gene pair:
-
Summarize Gene Pair Relationship Types and Frequencies
- Example pairs:
["CTNNB1", "GLCE"]
["CTNNB1", "NOTUM"]
- Example pairs:
-
Construct Gene Network Graph
- Nodes = genes, edges = literature-backed relationships weighted by frequency/strength
-
Extract Gene Relationship Excerpts
- Retrieve textual evidence supporting the relationships.
-
Contextualize Gene Relationships in Disease (Endometrial Carcinoma)
- Analyze how gene interactions relate to the disease context.
- Use GSEA results, pathway data, and literature to generate hypotheses.
- Ask:
Can you summarize these excerpts in the context of prospective endometrial carcinoma?
What hypotheses can be drawn from this analysis?
-
Suggest Future Research Directions
- Propose:
- Novel hypotheses
- Drug targets or pathway interventions
- Further bioinformatics studies (e.g., single-cell RNA-seq)
- Propose:
Currently, the agent has one tool: query_gene_pair
defined in:
src/discovera/agent.py
To add more tools:
- Copy the format of
query_gene_pair
. - Tools are registered with
@tool
from Archytas. - Archytas expects arguments as string variable names, not DataFrames directly.
- For example, this will correctly pass a variable and a string:
query_gene_pair({{ dataset }}, target="{{ target }}", method="{{ method }}")
- The underlying code executed lives in:
procedures/python3/query_gene_pair.py
There are two main areas to adjust the agent's behavior:
-
Context Management
- File:
src/discovera/context.py
- Modify the
auto_context
function to alter background knowledge or enumerate tools.
- File:
-
Prompt Customization
- File:
agent.py
- The main system prompt is written in the
BKDAgent
docstring.
- File:
Use these to tweak how the agent interprets user queries, formats responses, and integrates with tools.