Add pipeline framework #61

jenniferjiangkells · 2024-09-18T15:14:06Z

Description

Addresses #54

Introduces the concept of Pipeline, Component, and DataContainers.

These are the building blocks of the pipelineing component of HC. We can give the users 3 levels of control:

Build your own pipeline, using inline functions - this is the easiest and most flexible level, good for quick experiments
Build your own pipeline, using Component classes - this adds an extra layer of abstraction, especially useful for wrapping specific models such as MedCAT, ClinicalBERT, LLMs.
Use prebuilt pipelines e.g. MedicalCodingPipeline - prebuilt pipelines are a pre-configured set of components for specific use cases and has the highest level of abstraction. This is the easiest to get up and running with something functional.

Extra - loading pipeline integrations from other libraries such as spacy, huggingface, etc. #27

Went over-scope a little and additionally implemented TextPreprocessor, Model, TextPostprocessor, and MedicalCodingPipeline, which are implementations of Component and Pipeline. Helps to see what I want the downstream usage to look like.

It's probably best to introduce new concepts through examples, so here's a code snippet:

from healthchain.io.containers import Document
from healthchain.pipeline import Pipeline
from healthchain.pipeline.components import Model
from healthchain.pipeline.components import TextPostProcessor
from healthchain.pipeline.components import TextPreprocessor
from healthchain.pipeline import MedicalCodingPipeline

################################################################################
# 1. Build your own pipeline, using inline functions
################################################################################

# initialise the pipeline with the data type you want to process
nlp_pipeline = Pipeline[Document]()

@nlp_pipeline.add(stage="preprocessing")
def tokenize(doc: Document) -> Document:
    doc.tokens = doc.text.split()
    return doc


@nlp_pipeline.add(stage="preprocessing", dependencies=["tokenize"])
def pos_tag(doc: Document) -> Document:
    # Dummy POS tagging
    doc.pos_tags = ["NOUN" if token[0].isupper() else "VERB" for token in doc.tokens]
    return doc


@nlp_pipeline.add(dependencies=["tokenize", "pos_tag"])
def ner(doc: Document) -> Document:
    # Dummy NER
    doc.entities = [
        token for token, pos in zip(doc.tokens, doc.pos_tags) if pos == "NOUN"
    ]
    return doc

print("Initial pipeline:")
print(nlp_pipeline)
print(nlp_pipeline.stages)

@nlp_pipeline.add(position="after", reference="tokenize")
def remove_stopwords(doc: Document) -> Document:
    stopwords = {"the", "a", "an", "in", "on", "at"}
    doc.tokens = [token for token in doc.tokens if token not in stopwords]
    return doc

print("After adding remove_stopwords:")
print(nlp_pipeline)
print(nlp_pipeline.stages)


# Remove method
def new_tokenizer(doc: Document) -> Document:
    doc.tokens = doc.text.split() + ["<EOS>"]  # Add end-of-sentence token
    return doc


nlp_pipeline.remove("tokenize")
nlp_pipeline.add(new_tokenizer, name="tokenize", position="first")

# Replace method
def advanced_ner(doc: Document) -> Document:
    # More sophisticated NER logic
    doc.entities = [
        token for token in doc.tokens if token[0].isupper() and len(token) > 1
    ]
    return doc

nlp_pipeline.replace("ner", advanced_ner)

print("After replacing ner:")
print(nlp_pipeline)
print(nlp_pipeline.stages)

# Usage
# NLP pipeline
nlp = nlp_pipeline.build()

doc = Document("OpenAI released GPT-4 in 2023.")

result = nlp(doc)
print(f"Char count: {doc.char_count()}")
print(f"Word count: {doc.word_count()}")
print(f"Tokens: {result.tokens}")
print(f"POS Tags: {result.pos_tags}")
print(f"NER: {result.get_entities()}")

preprocessing_components = nlp_pipeline._stages.get("preprocessing", [])
print(f"Preprocessing components: {[c.__name__ for c in preprocessing_components]}")


################################################################################
# 2. Build your own pipeline, using Component classes (or mix and match)
################################################################################
component_pipeline = Pipeline[Document]()

component_pipeline.add(TextPreprocessor())
component_pipeline.add(Model(model_path="path/to/model"))
component_pipeline.add(TextPostProcessor())
component_pipeline.add(remove_stopwords, position="last")

# Or this is how you would configure it, not sure about adding an extra config, seems a bit clunky, might remove
# postprocessor_config = TextPostProcessorConfig(
#     postcoordination_lookup={
#         "heart attack": "myocardial infarction",
#         "high blood pressure": "hypertension"
#     }
# )
# component_pipeline.add(TextPostProcessor(postprocessor_config))

components = component_pipeline.build()
result = components(doc)

print(component_pipeline)
print(component_pipeline.stages)
print(f"Tokens: {result.tokens}")
print(f"POS Tags: {result.pos_tags}")
print(f"NER: {result.entities}")


################################################################################
# 3. Use prebuilt pipelines e.g. MedicalCodingPipeline
################################################################################
pipeline = MedicalCodingPipeline.load("./path/to/model")

coding_pipeline = pipeline.build()
result = coding_pipeline(doc)

print(pipeline)
print(pipeline.stages)
print(f"Processed Text: {result.text}")
print(f"Tokens: {result.tokens}")
print(f"Entities: {result.entities}")

jenniferjiangkells · 2024-09-27T19:12:33Z

@adamkells dw im going to explain everything ☝️

adamkells

I think this is great. There's a fair bit of complexity but it's all hidden away in such a way that the user doesn't need to worry. This is really impressive and a great next step for the project. I've left a few minor comments but overall great job!

healthchain/pipeline/__init__.py

healthchain/pipeline/basepipeline.py

…to feature/pipelines

jenniferjiangkells · 2024-10-03T17:01:24Z

@adamkells addressed all your comments, don't think you need to go over the code but feel free to check the documentation as I made quite a lot of changes there

jenniferjiangkells · 2024-10-04T15:14:49Z

@adamkells my man can i just merge this

adamkells · 2024-10-04T15:20:29Z

Oh go on then

Added pipelines WIP

3307610

jenniferjiangkells self-assigned this Sep 18, 2024

jenniferjiangkells linked an issue Sep 18, 2024 that may be closed by this pull request

Design and implement pipeline framework #54

Closed

jenniferjiangkells added 9 commits September 25, 2024 11:03

Added pipeline and io components

4364258

Added validation and tests

9aec4a6

Tidied up typing, added property utils to pipeline, updated tests

9256f3e

Fix component name string in stages property

7d34a52

Changed model name to be generic

6d6ad33

Added methods to data containers

f7b4abc

Add simple preprocessing and postprocessing components

01a472a

Update dependencies

b7d1440

Remove print statement

9091fe6

jenniferjiangkells requested a review from adamkells September 27, 2024 19:11

adamkells reviewed Sep 30, 2024

View reviewed changes

jenniferjiangkells added 7 commits October 3, 2024 08:59

Fix preprocessor name

e440093

Remove configs from pre and postprocessors

1de2b16

Fix Discord link

7bd555d

Update documentation

e1cc3da

Merge branch 'main' of https://github.com/dotimplement/HealthChain in…

8139259

…to feature/pipelines

Make pipeline wrapper callable method less verbose

c25619b

Fail removing/replacing non-existing components louder

d9344a7

jenniferjiangkells requested a review from adamkells October 3, 2024 17:00

jenniferjiangkells added 3 commits October 4, 2024 15:47

Update README.md

b8bf398

Added built-in .build() when pipeline is first called

3cb008e

Update docs with usage

546b02b

README.md

0de751e

README.md - link

6cde79c

jenniferjiangkells merged commit 032f07e into main Oct 4, 2024
5 checks passed

jenniferjiangkells linked an issue Oct 4, 2024 that may be closed by this pull request

Update documentation with pipeline usage #56

Closed

jenniferjiangkells deleted the feature/pipelines branch October 5, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pipeline framework #61

Add pipeline framework #61

jenniferjiangkells commented Sep 18, 2024 •

edited

Loading

jenniferjiangkells commented Sep 27, 2024

adamkells left a comment

jenniferjiangkells commented Oct 3, 2024

jenniferjiangkells commented Oct 4, 2024

adamkells commented Oct 4, 2024

Add pipeline framework #61

Add pipeline framework #61

Conversation

jenniferjiangkells commented Sep 18, 2024 • edited Loading

Description

jenniferjiangkells commented Sep 27, 2024

adamkells left a comment

Choose a reason for hiding this comment

jenniferjiangkells commented Oct 3, 2024

jenniferjiangkells commented Oct 4, 2024

adamkells commented Oct 4, 2024

jenniferjiangkells commented Sep 18, 2024 •

edited

Loading