Add pipeline framework (#61)

* Added pipelines WIP * Added pipeline and io components * Added validation and tests * Tidied up typing, added property utils to pipeline, updated tests * Fix component name string in stages property * Changed model name to be generic * Added methods to data containers * Add simple preprocessing and postprocessing components * Update dependencies * Remove print statement * Fix preprocessor name * Remove configs from pre and postprocessors * Fix Discord link * Update documentation * Make pipeline wrapper callable method less verbose * Fail removing/replacing non-existing components louder * Update README.md * Added built-in .build() when pipeline is first called * Update docs with usage * README.md * README.md - link
dotimplement · Oct 4, 2024 · 032f07e · 032f07e
1 parent cdfabb9
commit 032f07e
Show file tree

Hide file tree

Showing 49 changed files with 4,682 additions and 1,127 deletions.
diff --git a/.gitignore b/.gitignore
@@ -160,5 +160,7 @@ cython_debug/
 #.idea/
 
 output/
+scrap/
 .DS_Store
 .vscode/
+.ruff_cache/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -26,7 +26,7 @@ If you're a developer, there are many ways you can contribute code:
 
 ## Join Our Discord
 
-Are you a domain expert with valuable insights? We encourage you to join our [Discord community](https://discord.gg/4v6XgGBZ) and share your wisdom. Your expertise can help shape the future of the project and guide us in making informed decisions.
+Are you a domain expert with valuable insights? We encourage you to join our [Discord community](https://discord.gg/UQC6uAepUz) and share your wisdom. Your expertise can help shape the future of the project and guide us in making informed decisions.
 
 We believe that every contribution, big or small, makes a difference. Thank you for being a part of our community!
 

diff --git a/README.md b/README.md
@@ -10,138 +10,183 @@
 
 </div>
 
-Simplify testing and evaluating AI and NLP applications in a healthcare context 💫 🏥.
+Simplify developing, testing and validating AI and NLP applications in a healthcare context 💫 🏥.
 
-Building applications that integrate in healthcare systems is complex, and so is designing reliable, reactive algorithms involving unstructured data. Let's try to change that.
+Building applications that integrate with electronic health record systems (EHRs) is complex, and so is designing reliable, reactive algorithms involving unstructured data. Let's try to change that.
 
 ```bash
 pip install healthchain
 ```
-First time here? Check out our [Docs](dotimplement.github.io/HealthChain/) page!
+First time here? Check out our [Docs](https://dotimplement.github.io/HealthChain/) page!
 
 ## Features
-- [x] 🍱 Create sandbox servers and clients that comply with real EHRs API and data standards.
-- [x] 🗃️ Generate synthetic FHIR resources or load your own data as free-text.
-- [x] 💾 Save generated request and response data for each sandbox run.
-- [x] 🎈 Streamlit dashboard to inspect generated data and responses.
-- [x] 🧪 Experiment with LLMs in an end-to-end HL7-compliant pipeline from day 1.
+- [x] 🛠️ Build custom pipelines or use [pre-built ones](https://dotimplement.github.io/HealthChain/reference/pipeline/pipeline/#prebuilt) for your healthcare NLP and ML tasks
+- [x] 🏗️ Add built-in CDA and FHIR parsers to connect your pipeline to interoperability standards
+- [x] 🧪 Test your pipelines in full healthcare-context aware [sandbox](https://dotimplement.github.io/HealthChain/reference/sandbox/sandbox/) environments
+- [x] 🗃️ Generate [synthetic healthcare data](https://dotimplement.github.io/HealthChain/reference/utilities/data_generator/) for testing and development
+- [x] 🚀 Deploy sandbox servers locally with [FastAPI](https://fastapi.tiangolo.com/)
 
 ## Why use HealthChain?
--  **Scaling EHR integrations is a manual and time-consuming process** - HealthChain abstracts away complexities so you can focus on AI development, not EHR configurations.
--  **Evaluating the behaviour of AI in complex systems is a difficult and labor-intensive task** - HealthChain provides a framework to test the real-world resilience of your whole system, not just your models.
--  **[Most healthcare data is unstructured](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6372467/)** - HealthChain is optimised for real-time AI/NLP applications that deal with realistic healthcare data.
+-  **EHR integrations are manual and time-consuming** - HealthChain abstracts away complexities so you can focus on AI development, not EHR configurations.
+-  **It's difficult to track and evaluate multiple integration instances** - HealthChain provides a framework to test the real-world resilience of your whole system, not just your models.
+-  [**Most healthcare data is unstructured**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6372467/) - HealthChain is optimized for real-time AI and NLP applications that deal with realistic healthcare data.
 - **Built by health tech developers, for health tech developers** - HealthChain is tech stack agnostic, modular, and easily extensible.
 
-## Clinical Decision Support (CDS)
+## Pipeline
+Pipelines provide a flexible way to build and manage processing pipelines for NLP and ML tasks that can easily interface with parsers and connectors to integrate with EHRs.
+
+### Building a pipeline
+
+```python
+from healthchain.io.containers import Document
+from healthchain.pipeline import Pipeline
+from healthchain.pipeline.components import TextPreProcessor, Model, TextPostProcessor
+
+# Initialize the pipeline
+nlp_pipeline = Pipeline[Document]()
+
+# Add TextPreProcessor component
+preprocessor = TextPreProcessor(tokenizer="spacy")
+nlp_pipeline.add(preprocessor)
+
+# Add Model component (assuming we have a pre-trained model)
+model = Model(model_path="path/to/pretrained/model")
+nlp_pipeline.add(model)
+
+# Add TextPostProcessor component
+postprocessor = TextPostProcessor(
+    postcoordination_lookup={
+        "heart attack": "myocardial infarction",
+        "high blood pressure": "hypertension"
+    }
+)
+nlp_pipeline.add(postprocessor)
+
+# Build the pipeline
+nlp = nlp_pipeline.build()
+
+# Use the pipeline
+result = nlp(Document("Patient has a history of heart attack and high blood pressure."))
+
+print(f"Entities: {result.entities}")
+```
+### Using pre-built pipelines
+
+```python
+from healthchain.io.containers import Document
+from healthchain.pipeline import MedicalCodingPipeline
+
+# Load the pre-built MedicalCodingPipeline
+pipeline = MedicalCodingPipeline.load("./path/to/model")
+
+# Create a document to process
+result = pipeline(Document("Patient has a history of myocardial infarction and hypertension."))
+
+print(f"Entities: {result.entities}")
+```
+
+## Sandbox
+
+Sandboxes provide a staging environment for testing and validating your pipeline in a realistic healthcare context.
+
+### Clinical Decision Support (CDS)
 [CDS Hooks](https://cds-hooks.org/) is an [HL7](https://cds-hooks.hl7.org) published specification for clinical decision support.
 
 **When is this used?** CDS hooks are triggered at certain events during a clinician's workflow in an electronic health record (EHR), e.g. when a patient record is opened, when an order is elected.
 
-**What information is sent**: the context of the event and FHIR resources that are requested by your service, for example, the patient ID and information on the encounter and conditions they are being seen for.
+**What information is sent**: the context of the event and [FHIR](https://hl7.org/fhir/) resources that are requested by your service, for example, the patient ID and information on the encounter and conditions they are being seen for.
 
 **What information is returned**: “cards” displaying text, actionable suggestions, or links to launch a [SMART](https://smarthealthit.org/) app from within the workflow.
 
-**What you need to decide**: What data do I want my EHR client to send, and how will my service process this data.
-
 
 ```python
 import healthchain as hc
 
+from healthchain.pipeline import Pipeline
 from healthchain.use_cases import ClinicalDecisionSupport
 from healthchain.models import Card, CdsFhirData, CDSRequest
-from healthchain.data_generator import DataGenerator
-
+from healthchain.data_generator import CdsDataGenerator
 from typing import List
 
-# Decorate class with sandbox and pass in use case
 @hc.sandbox
-class myCDS(ClinicalDecisionSupport):
+class MyCDS(ClinicalDecisionSupport):
     def __init__(self) -> None:
-        self.data_generator = DataGenerator()
+        self.pipeline = Pipeline.load("./path/to/model")
+        self.data_generator = CdsDataGenerator()
 
     # Sets up an instance of a mock EHR client of the specified workflow
     @hc.ehr(workflow="patient-view")
     def ehr_database_client(self) -> CdsFhirData:
-        self.data_generator.generate()
-        return self.data_generator.data
+        return self.data_generator.generate()
 
     # Define your application logic here
     @hc.api
-    def my_service(self, request: CdsRequest) -> List[Card]:
-        result = "Hello " + request["patient_name"]
-        return result
-
-if __name__ == "__main__":
-    cds = myCDS()
-    cds.start_sandbox()
-```
-
-Then run:
-```bash
-healthchain run mycds.py
+    def my_service(self, data: CDSRequest) -> List[Card]:
+        result = self.pipeline(data)
+        return [
+            Card(
+                summary="Welcome to our Clinical Decision Support service.",
+                detail=result.summary,
+                indicator="info"
+            )
+        ]
 ```
-This will populate your EHR client with the data generation method you have defined, send requests to your server for processing, and save the data in `./output` by default.
 
-## Clinical Documentation
+### Clinical Documentation
 
-The ClinicalDocumentation use case implements a real-time Clinical Documentation Improvement (CDI) service. It helps convert free-text medical documentation into coded information that can be used for billing, quality reporting, and clinical decision support.
+The `ClinicalDocumentation` use case implements a real-time Clinical Documentation Improvement (CDI) service. It helps convert free-text medical documentation into coded information that can be used for billing, quality reporting, and clinical decision support.
 
 **When is this used?** Triggered when a clinician opts in to a CDI functionality (e.g. Epic NoteReader) and signs or pends a note after writing it.
 
-**What information is sent**: A [CDA (Clinical Document Architecture)](https://www.hl7.org/implement/standards/product_brief.cfm?product_id=7) document which contains continuity of care data and free-text data, e.g. a patient's problem list and the progress note that the clinician has entered in the EHR.
-
-**What information is returned**: A CDA document which contains additional structured data extracted and returned by your CDI service.
+**What information is sent**: A [CDA (Clinical Document Architecture)](https://www.hl7.org.uk/standards/hl7-standards/cda-clinical-document-architecture/) document which contains continuity of care data and free-text data, e.g. a patient's problem list and the progress note that the clinician has entered in the EHR.
 
 ```python
 import healthchain as hc
 
+from healthchain.pipeline import MedicalCodingPipeline
 from healthchain.use_cases import ClinicalDocumentation
 from healthchain.models import CcdData, ProblemConcept, Quantity,
 
 @hc.sandbox
 class NotereaderSandbox(ClinicalDocumentation):
     def __init__(self):
-        self.cda_path = "./resources/uclh_cda.xml"
+        self.pipeline = MedicalCodingPipeline.load("./path/to/model")
 
     # Load an existing CDA file
     @hc.ehr(workflow="sign-note-inpatient")
     def load_data_in_client(self) -> CcdData:
-        with open(self.cda_path, "r") as file:
+        with open("/path/to/cda/data.xml", "r") as file:
             xml_string = file.read()
 
         return CcdData(cda_xml=xml_string)
 
-    # Define application logic
     @hc.api
     def my_service(self, ccd_data: CcdData) -> CcdData:
-        # Apply method from ccd_data.note and access existing entries from ccd.problems
-
-        new_problem = ProblemConcept(
-            code="38341003",
-            code_system="2.16.840.1.113883.6.96",
-            code_system_name="SNOMED CT",
-            display_name="Hypertension",
-            )
-        ccd_data.problems.append(new_problem)
-        return ccd_data
+        annotated_ccd = self.pipeline(ccd_data)
+        return annotated_ccd
 ```
+### Running a sandbox
 
+Ensure you run the following commands in your `mycds.py` file:
 
-### Streamlit dashboard
-Note this is currently not meant to be a frontend to the EHR client, so you will have to run it separately from the sandbox application.
+```python
+cds = MyCDS()
+cds.run_sandbox()
+```
+This will populate your EHR client with the data generation method you have defined, send requests to your server for processing, and save the data in the `./output` directory.
 
+Then run:
 ```bash
-pip install streamlit
-streamlit streamlit-demo/app.py
+healthchain run mycds.py
 ```
-
+By default, the server runs at `http://127.0.0.1:8000`, and you can interact with the exposed endpoints at `/docs`.
 ## Road Map
-- [x] 📝 Adding Clinical Documentation use case
-- [ ] 🎛️ Version and test different EHR backend configurations
-- [ ] 🤖 Integrations with popular LLM and NLP libraries
-- [ ] ❓ Evaluation framework for pipelines and use cases
+- [ ] 🎛️ Versioning and artifact management for pipelines sandbox EHR configurations
+- [ ] 🤖 Integrations with other pipeline libraries such as spaCy, HuggingFace, LangChain etc.
+- [ ] ❓ Testing and evaluation framework for pipelines and use cases
+- [ ] 🧠 Multi-modal pipelines that that have built-in NLP to utilize unstructured data
 - [ ] ✨ Improvements to synthetic data generator methods
-- [ ] 👾 Frontend demo for EHR client
+- [ ] 👾 Frontend UI for EHR client and visualization features
 - [ ] 🚀 Production deployment options
 
 ## Contribute

diff --git a/docs/api/component.md b/docs/api/component.md
@@ -0,0 +1,6 @@
+# Component
+
+::: healthchain.pipeline.components.basecomponent
+::: healthchain.pipeline.components.preprocessors
+::: healthchain.pipeline.components.models
+::: healthchain.pipeline.components.postprocessors
diff --git a/docs/api/containers.md b/docs/api/containers.md
@@ -0,0 +1,3 @@
+# Containers
+
+::: healthchain.io.containers
diff --git a/docs/api/pipeline.md b/docs/api/pipeline.md
@@ -0,0 +1,3 @@
+# Pipeline
+
+::: healthchain.pipeline.basepipeline
diff --git a/docs/community/contribution_guide.md b/docs/community/contribution_guide.md
@@ -0,0 +1 @@
+# Contribution Guide
diff --git a/docs/community/resources.md b/docs/community/resources.md
@@ -0,0 +1 @@
+# Resources
diff --git a/docs/cookbook/cds_sandbox.md b/docs/cookbook/cds_sandbox.md
@@ -0,0 +1,48 @@
+# Build a CDS sandbox
+
+A CDS sandbox which uses `gpt-4o` to summarise patient information from synthetically generated FHIR resources received from the `patient-view` CDS hook.
+
+```python
+import healthchain as hc
+
+from healthchain.use_cases import ClinicalDecisionSupport
+from healthchain.data_generators import CdsDataGenerator
+from healthchain.models import Card, CdsFhirData, CDSRequest
+
+from langchain_openai import ChatOpenAI
+from langchain_core.prompts import PromptTemplate
+from langchain_core.output_parsers import StrOutputParser
+
+from typing import List
+
+@hc.sandbox
+class CdsSandbox(ClinicalDecisionSupport):
+  def __init__(self):
+    self.chain = self._init_llm_chain()
+    self.data_generator = CdsDataGenerator()
+
+  def _init_llm_chain(self):
+    prompt = PromptTemplate.from_template(
+      "Extract conditions from the FHIR resource below and summarize in one sentence using simple language \n'''{text}'''"
+      )
+    model = ChatOpenAI(model="gpt-4o")
+    parser = StrOutputParser()
+
+    chain = prompt | model | parser
+    return chain
+
+  @hc.ehr(workflow="patient-view")
+  def load_data_in_client(self) -> CdsFhirData:
+    data = self.data_generator.generate()
+    return data
+
+  @hc.api
+  def my_service(self, request: CDSRequest) -> List[Card]:
+    result = self.chain.invoke(str(request.prefetch))
+    return Card(
+      summary="Patient summary",
+      indicator="info",
+      source={"label": "openai"},
+      detail=result,
+    )
+```
diff --git a/docs/cookbook/index.md b/docs/cookbook/index.md
@@ -1 +1,6 @@
-# Cookbook
+# Examples
+
+The best way to learn is by example! Here are some to get you started:
+
+- [Build a CDS sandbox](./cds_sandbox.md): Build a clinical decision support (CDS) system that uses *patient-view* to greet the patient.
+- [Build a Clinical Documentation sandbox](./notereader_sandbox.md): Build a NoteReader system which extracts problem, medication, and allergy concepts from free-text clinical notes.