Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipeline framework #61

Merged
merged 22 commits into from
Oct 4, 2024
Merged

Add pipeline framework #61

merged 22 commits into from
Oct 4, 2024

Conversation

jenniferjiangkells
Copy link
Member

@jenniferjiangkells jenniferjiangkells commented Sep 18, 2024

Description

Addresses #54

Introduces the concept of Pipeline, Component, and DataContainers.

These are the building blocks of the pipelineing component of HC. We can give the users 3 levels of control:

  1. Build your own pipeline, using inline functions - this is the easiest and most flexible level, good for quick experiments
  2. Build your own pipeline, using Component classes - this adds an extra layer of abstraction, especially useful for wrapping specific models such as MedCAT, ClinicalBERT, LLMs.
  3. Use prebuilt pipelines e.g. MedicalCodingPipeline - prebuilt pipelines are a pre-configured set of components for specific use cases and has the highest level of abstraction. This is the easiest to get up and running with something functional.

Extra - loading pipeline integrations from other libraries such as spacy, huggingface, etc. #27

Went over-scope a little and additionally implemented TextPreprocessor, Model, TextPostprocessor, and MedicalCodingPipeline, which are implementations of Component and Pipeline. Helps to see what I want the downstream usage to look like.

It's probably best to introduce new concepts through examples, so here's a code snippet:

from healthchain.io.containers import Document
from healthchain.pipeline import Pipeline
from healthchain.pipeline.components import Model
from healthchain.pipeline.components import TextPostProcessor
from healthchain.pipeline.components import TextPreprocessor
from healthchain.pipeline import MedicalCodingPipeline

################################################################################
# 1. Build your own pipeline, using inline functions
################################################################################

# initialise the pipeline with the data type you want to process
nlp_pipeline = Pipeline[Document]()

@nlp_pipeline.add(stage="preprocessing")
def tokenize(doc: Document) -> Document:
    doc.tokens = doc.text.split()
    return doc


@nlp_pipeline.add(stage="preprocessing", dependencies=["tokenize"])
def pos_tag(doc: Document) -> Document:
    # Dummy POS tagging
    doc.pos_tags = ["NOUN" if token[0].isupper() else "VERB" for token in doc.tokens]
    return doc


@nlp_pipeline.add(dependencies=["tokenize", "pos_tag"])
def ner(doc: Document) -> Document:
    # Dummy NER
    doc.entities = [
        token for token, pos in zip(doc.tokens, doc.pos_tags) if pos == "NOUN"
    ]
    return doc

print("Initial pipeline:")
print(nlp_pipeline)
print(nlp_pipeline.stages)

@nlp_pipeline.add(position="after", reference="tokenize")
def remove_stopwords(doc: Document) -> Document:
    stopwords = {"the", "a", "an", "in", "on", "at"}
    doc.tokens = [token for token in doc.tokens if token not in stopwords]
    return doc

print("After adding remove_stopwords:")
print(nlp_pipeline)
print(nlp_pipeline.stages)


# Remove method
def new_tokenizer(doc: Document) -> Document:
    doc.tokens = doc.text.split() + ["<EOS>"]  # Add end-of-sentence token
    return doc


nlp_pipeline.remove("tokenize")
nlp_pipeline.add(new_tokenizer, name="tokenize", position="first")

# Replace method
def advanced_ner(doc: Document) -> Document:
    # More sophisticated NER logic
    doc.entities = [
        token for token in doc.tokens if token[0].isupper() and len(token) > 1
    ]
    return doc

nlp_pipeline.replace("ner", advanced_ner)

print("After replacing ner:")
print(nlp_pipeline)
print(nlp_pipeline.stages)

# Usage
# NLP pipeline
nlp = nlp_pipeline.build()

doc = Document("OpenAI released GPT-4 in 2023.")

result = nlp(doc)
print(f"Char count: {doc.char_count()}")
print(f"Word count: {doc.word_count()}")
print(f"Tokens: {result.tokens}")
print(f"POS Tags: {result.pos_tags}")
print(f"NER: {result.get_entities()}")

preprocessing_components = nlp_pipeline._stages.get("preprocessing", [])
print(f"Preprocessing components: {[c.__name__ for c in preprocessing_components]}")


################################################################################
# 2. Build your own pipeline, using Component classes (or mix and match)
################################################################################
component_pipeline = Pipeline[Document]()

component_pipeline.add(TextPreprocessor())
component_pipeline.add(Model(model_path="path/to/model"))
component_pipeline.add(TextPostProcessor())
component_pipeline.add(remove_stopwords, position="last")

# Or this is how you would configure it, not sure about adding an extra config, seems a bit clunky, might remove
# postprocessor_config = TextPostProcessorConfig(
#     postcoordination_lookup={
#         "heart attack": "myocardial infarction",
#         "high blood pressure": "hypertension"
#     }
# )
# component_pipeline.add(TextPostProcessor(postprocessor_config))

components = component_pipeline.build()
result = components(doc)

print(component_pipeline)
print(component_pipeline.stages)
print(f"Tokens: {result.tokens}")
print(f"POS Tags: {result.pos_tags}")
print(f"NER: {result.entities}")


################################################################################
# 3. Use prebuilt pipelines e.g. MedicalCodingPipeline
################################################################################
pipeline = MedicalCodingPipeline.load("./path/to/model")

coding_pipeline = pipeline.build()
result = coding_pipeline(doc)

print(pipeline)
print(pipeline.stages)
print(f"Processed Text: {result.text}")
print(f"Tokens: {result.tokens}")
print(f"Entities: {result.entities}")

@jenniferjiangkells jenniferjiangkells self-assigned this Sep 18, 2024
@jenniferjiangkells jenniferjiangkells linked an issue Sep 18, 2024 that may be closed by this pull request
@jenniferjiangkells
Copy link
Member Author

@adamkells dw im going to explain everything ☝️

Copy link
Contributor

@adamkells adamkells left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great. There's a fair bit of complexity but it's all hidden away in such a way that the user doesn't need to worry. This is really impressive and a great next step for the project. I've left a few minor comments but overall great job!

healthchain/pipeline/__init__.py Outdated Show resolved Hide resolved
healthchain/pipeline/basepipeline.py Outdated Show resolved Hide resolved
healthchain/pipeline/basepipeline.py Outdated Show resolved Hide resolved
healthchain/pipeline/basepipeline.py Show resolved Hide resolved
healthchain/pipeline/basepipeline.py Show resolved Hide resolved
healthchain/pipeline/basepipeline.py Show resolved Hide resolved
healthchain/pipeline/basepipeline.py Show resolved Hide resolved
@jenniferjiangkells
Copy link
Member Author

@adamkells addressed all your comments, don't think you need to go over the code but feel free to check the documentation as I made quite a lot of changes there

@jenniferjiangkells
Copy link
Member Author

@adamkells my man can i just merge this

@adamkells
Copy link
Contributor

Oh go on then

@jenniferjiangkells jenniferjiangkells merged commit 032f07e into main Oct 4, 2024
5 checks passed
@jenniferjiangkells jenniferjiangkells linked an issue Oct 4, 2024 that may be closed by this pull request
@jenniferjiangkells jenniferjiangkells deleted the feature/pipelines branch October 5, 2024 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update documentation with pipeline usage Design and implement pipeline framework
2 participants