Skip to content

hms-dbmi/pic-sure-chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š PIC-SURE Chatbot – Metadata Interaction Interface

This project provides an intelligent chatbot interface to interact with the PIC-SURE API, designed to assist researchers in exploring large-scale clinical and genomic datasets.

It operates without Retrieval-Augmented Generation (RAG), relying instead on structured metadata and natural language understanding through Large Language Models (LLMs) powered by Amazon Bedrock.

Through simple, conversational prompts, users can ask:

  • How many participants match a set of conditions,
  • What the distribution of a variable looks like,
  • Or what variables are available for a given research topic.

🎯 Objective

To simplify access to clinical and genomic metadata hosted in the PIC-SURE ecosystem by enabling a natural language workflow.
This chatbot transforms unstructured research questions into structured API queries β€” making metadata navigation faster, more accessible, and LLM-augmented.


🧠 Key Features

  • Intent: count
    Returns the number of participants matching filters extracted from the question.

  • Intent: distribution
    Builds and visualizes a histogram for a selected continuous variable, with optional filters.

  • Intent: information
    Summarizes available datasets or variables, often grouped by relevance or concept.

  • Multi-turn conversation support
    Maintains user context to allow follow-up questions and refinement.

  • Metadata-only focus
    Uses only PIC-SURE’s metadata endpoints (e.g., /concepts, /aggregate) β€”
    no direct access to patient-level data yet (HPDS-secured endpoints not included for now).


πŸ™Œ How to Use?

This section explains how to set up and run the PIC-SURE Chatbot locally.

πŸ”§ Option 1: Local Setup with Python

1. Clone the Repository

git clone https://github.com/hms-dbmi/pic-sure-chatbot.git
cd picsure-chatbot/back

2. Set Up a Python Virtual Environment

We recommend using venv to isolate dependencies.

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

3. Install Requirements

All dependencies are listed in requirements.txt.
Make sure this file exists at the root of back/.

pip install -r requirements.txt

4. Configure Access Credentials

Create or modify the confidential.py file (already included), and ensure it contains:

PICSURE_API_URL = "https://..."
PICSURE_TOKEN = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."  # Secure API token

You can obtain your token via the PIC-SURE platform or from your institution.

5. Run the Chatbot

  • πŸ” AWS Authentication (required) This project requires access to Amazon Bedrock via AWS SSO.
    • Make sure you are logged in:
      aws sso login --profile nhanes-dev
    • Set your AWS profile for the current terminal session:
      export AWS_PROFILE=nhanes-dev

Use the pipeline entry point, to start a conversational session:

python pipeline.py

🐳 Option 2: Using Docker (Recommended for Reproducibility)

This method lets you run the chatbot in a fully isolated environment without installing Python locally.

1. Build the Docker Image

From the root of the repository:

docker build -t pic-sure-chatbot .

2. Run the Chatbot in a Container

docker run -it \
  -v ~/.aws:/root/.aws \
  -e AWS_PROFILE=nhanes-dev \
  pic-sure-chatbot

You can also access a bash shell inside the container for development:

docker run -it pic-sure-chatbot /bin/bash

βœ… Environment Recap

  • Python β‰₯ 3.9 recommended
  • Dependencies: boto3, requests, matplotlib, pandas, pyyaml, etc.
  • Docker alternative available (Dockerfile included)
  • Amazon credentials must be set via AWS CLI or ~/.aws/credentials for Bedrock access

Need help? Reach out via Issues or email the author.


πŸ—‚ Project Structure

back/
β”œβ”€β”€ plots/                  # Contains generated histogram images
β”œβ”€β”€ prompts/                # YAML prompt templates used for LLM calls
β”œβ”€β”€ utils/                  # Core logic for each chatbot intent
β”‚   β”œβ”€β”€ count.py              # Filter extraction and count query execution
β”‚   β”œβ”€β”€ distribution.py       # Distribution variable selection and plotting
β”‚   β”œβ”€β”€ extract_metadata.py   # Metadata parsing (intent, search terms, dataset)
β”‚   β”œβ”€β”€ information.py        # Natural answer generation from metadata
β”‚   └── llm.py                # Core LLM call logic and utilities
β”œβ”€β”€ context.py              # Fetches available datasets from the PIC-SURE API
β”œβ”€β”€ confidential.py         # Stores PIC-SURE token and API URL
β”œβ”€β”€ pipeline.py             # Main chatbot execution pipeline

🧱 How It Works – Chatbot Pipeline

The chatbot follows a three-step logic, depending on the user's intent.

USER QUESTION
   β”‚
   β–Ό
[Step 1] Metadata Extraction (via LLM)
   └── extract_metadata.py
       β”œβ”€β”€ intent: "count", "distribution", or "metadata"
       β”œβ”€β”€ search_terms (keywords)
       β”œβ”€β”€ dataset (if mentioned)
       └── variable type (categorical, continuous, or both)

[Step 2] Intent-specific resolution
   β”œβ”€β”€ count.py β†’ Extracts filters + sends COUNT query to PIC-SURE API
   β”œβ”€β”€ distribution.py β†’ Selects one variable + filters + fields β†’ API call β†’ DataFrame β†’ plot
   └── information.py β†’ Returns a natural-language summary answer (no API call)

[Step 3] Response generation
   β”œβ”€β”€ count/distribution β†’ Structured message with number/plot
   └── information β†’ Direct LLM answer returned as-is

The process is designed to be modular, so each intent type has its own logic block, input format, and output structure.


πŸ“€ Prompt Templates

All prompts sent to the LLMs are defined in a central YAML file:

back/prompts/base_prompts.yml

Each key corresponds to a specific processing step or chatbot intent.

🧾 Prompt Sections

Key name Role / Intent
metadata_extraction Parses question to extract intent, dataset, and search terms
count_filter_extraction Identifies exact filters and values from metadata
distribution_extraction Selects a continuous variable + filters + involved fields
information_response Generates a natural-language summary using available variables

🧩 Variable injection

{user_question} is used in all prompts {variable_context} is used in:

  • count_filter_extraction
  • distribution_extraction
  • information_response

πŸ’‘ Prompt Design Highlights

  • All prompts return strict JSON for structured parsing (except information_response)
  • Prompts contain:
    • Domain-specific instructions (e.g., how to infer "morbid obesity")
    • Dataset coherence constraints (no mixing datasets)
    • Fallback behavior if no result is found

πŸ”§ Suggested Improvements

  • Let LLM return nb_bins in distribution_extractionfor histogram granularity (currently fixed at 20)
  • Add optional filter_description for labeling plots or responses
  • Return flags like "uncertainty": true to detect edge-case answers
  • Auto-inject {datasets} dynamically (using context.py)

πŸ’¬ Conversational Behavior

The chatbot supports multi-turn interactions, which allows users to refine questions or continue exploring a topic without repeating themselves.

This modular logic ensures each intent has the right balance between memory and performance.

chat.previous_interaction

Used in:

  • extract_metadata.py
  • count.py
  • distribution.py

β†’ Only the last exchange is passed to keep prompts lightweight and context-specific.

chat.chat_history

Used only in:

  • information.py

β†’ The entire history is passed to the LLM, since:

  • Information questions tend to be vague or exploratory
  • A broader context improves relevance and reduces ambiguity

πŸ” Module-by-Module Breakdown

This section describes the role of each key Python script and its logic.

extract_metadata.py

metadata = extract_query_metadata(
    user_question,
    previous_interaction=(previous_user, previous_bot),
    previous_extracted_metadata=self. previous_extracted_metadata
)
  • Role: Handles Step 1 of the chatbot: infers intent, search_terms, dataset, and optionally type from the user’s question.

  • LLM Prompt: metadata_extraction

  • Behavior:

    • May fall back to general exploration mode if input is vague
    • Uses previous_interaction as context
    • Also passes previous extracted metadata (search terms, etc.)
      β†’ This improves consistency across multi-turn queries about the same topic.
  • Output example:

{
  "intent": "count",
  "search_terms": ["sex", "gender", "age", "year"],
  "dataset": "Synthea",
  "type": null
}
  • Notes:
    • type is currently unused, but extracted for potential filtering logic in the future.
    • Called before any API request is made
    • If previous interactions, also passes previous extracted metadata (search terms, etc.)
      β†’ This improves consistency across multi-turn queries about the same topic.

count.py

count_result = count_from_metadata(
  user_question,
  metadata,
  variables,
  self.token,
  self.api_url,
  self.previous_interaction
)
  • Role: Handles intent = "count"
    • Builds a new prompt with the question + variable context
    • Extracts filters and optional genomic conditions (genes)
    • Sends the payload to PIC-SURE /aggregate endpoint
  • LLM Prompt: count_filter_extraction
  • Notes:
    • Filters must all come from the same dataset
    • correct_filter_values() ensures exact match with categorical values

πŸ’‘ Dev Tip: You may improve reliability by pre-filtering variables by dataset before calling correct_filter_values().

distribution.py

df, filter_desc = extract_distribution_dataframe(
  user_question,
  metadata,
  variables,
  self.token,
  self.api_url,
  self.previous_interaction
)

filename = plot_distribution(df, filter_description=filter_desc)
  • Role: Handles intent = "distribution"
    • Selects one continuous variable to plot
    • Extracts filters and genomic fields
    • Sends query β†’ gets DataFrame β†’ keeps only relevant column β†’ plots histogram
  • LLM Prompt: distribution_extraction
  • Plotting:
    • Uses matplotlib
    • Number of bins is fixed to 20 (βœ… future option to make dynamic)
  • Returns:
    • A saved plot in /plots/
    • A title generated from filters

information.py

final_answer = information_from_metadata(
  user_question,
  variables,
  list(self.chat_history)
)
  • Role: Handles intent = "metadata"
    • Builds and sends prompt with variable list and user question
    • Returns raw LLM output directly (no post-processing)
  • LLM Prompt: information_response
  • Behavior:
    • Groups results by dataset
    • May mention units, types, concept paths
    • Uses full chat_history, not just previous interaction
      β†’ This is intentional, as "information" questions are often vague or broad. Including the full context improves relevance and consistency of the LLM’s response.

llm.py

  • Role: Contains all core logic for interacting with Amazon Bedrock.
  • Key functions:
    • call_bedrock_llm(prompt) – universal LLM call
    • robust_llm_json_parse() – corrects common LLM formatting issues
    • validate_llm_response() – ensures output schema matches expectation
    • correct_filter_values() – maps predicted filter values to valid metadata
  • Model: Default is mistral.mistral-large-2402-v1:0, but others are available

πŸ’‘ Dev Tip: You could expose model_id and temperature in pipeline configs for easier tuning.

context.py

  • Role: Auxiliary tool to retrieve the list of available datasets from the PIC-SURE API.
  • Function: get_available_datasets(token)
  • Use case:
    • To dynamically populate the {datasets} placeholder in metadata_extraction
    • Or for debugging / dataset discovery

Note: Not used dynamically in production yet.


πŸ§ͺ Examples & Test Cases

This section outlines real user questions and how the chatbot processes them, based on your documented test cases.

🧠 Intent: metadata

Example 1:

What are the demographic variables that are available?

The bot responds with:

**NHANES**
- AGE (Continuous, years) β€” \Nhanes\demographics\AGE
- RACE (Categorical) β€” \Nhanes\demographics\RACE
- Education level β€” \Nhanes\demographics\EDUCATION

**Synthea**
- Age (Continuous, years) β€” \Synthea\demographics\Age
- Race (Categorical) β€” \Synthea\demographics\Race
- Sex (Categorical) β€” \Synthea\demographics\Sex

Example 2:

Are there variables related to body mass index?

Search terms inferred: ['bmi', 'body mass index', 'obesity', 'fat', 'weight']
Suggested variables include:

  • Body Mass Index (kg/mΒ²)
  • Weight (kg)
  • Total Fat (g)
  • Trunk Fat (g)

πŸ”’ Intent: count

Example 1:

How many women over 60 in Synthea?

Produces:

{
  "filters": {
    "AGE": { "min": "60" },
    "SEX": ["Female"]
  },
  "genomic": null
}

β†’ PIC-SURE API returns the participant count.

Example 2:

How many participants with a variant in BRCA1 and over 50 years old?

Adds genomic filter:

"genomic": {
  "gene": ["BRCA1"]
}

πŸ“Š Intent: distribution

Example 1:

What is the BMI distribution of participants with extreme obesity and age over 21?

Returns:

{
  "distribution_variable": "Body Mass Index (kg per mΒ²)",
  "filters": {
    "Body Mass Index (kg per mΒ²)": { "min": "40" },
    "AGE": { "min": "21" }
  },
  "fields": ["Body Mass Index (kg per mΒ²)", "AGE"],
  "genomic": null
}

β†’ Histogram saved in /plots/

Example 2:

what is the distribution of the age for males above 21 years old with HTR4 variant in 1000genome?

Returns:

{
  "distribution_variable": "SIMULATED AGE",
  "filters": {
    "SIMULATED AGE": {"min": "21"},
    "SEX": ["male"]
  },
  "fields": [
    "SIMULATED AGE",
    "SEX"
  ],
  "genomic": {
    "gene": ["HTR4"]
  }
}

β†’ Histogram saved in /plots/


πŸ›  Developer Notes & Future Improvements

Below are areas for future work or known architectural considerations:

🧹 General

  • Improve modularity in pipeline.py
  • Unify logging and error handling
  • Add test suite with mocked Bedrock + PIC-SURE responses

πŸ§ͺ Prompts

  • Allow LLM to return nb_bins for plots (currently hardcoded to 20)
  • Detect and flag uncertain outputs ("uncertainty": true)
  • Add explicit support for variable type queries (e.g., β€œonly categorical”)

πŸ”§ Filtering & Correction Logic

  • Pre-filter variable list by dataset before running correct_filter_values().

πŸ“¦ Dataset Management

  • Use context.py to fetch datasets dynamically and inject into prompts
  • Cache variable metadata and avoid reloading in each step

πŸ™ Acknowledgments

This project was developed as part of a research internship at the
Avillach Lab, Department of Biomedical Informatics – Harvard Medical School (2025). Louis Hayot

It integrates Amazon Bedrock for LLM calls and the PIC-SURE API for clinical/genomic data metadata access.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •