Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research and design pipeline integration methods #27

Closed
jenniferjiangkells opened this issue Jun 6, 2024 · 6 comments · May be fixed by #48
Closed

Research and design pipeline integration methods #27

jenniferjiangkells opened this issue Jun 6, 2024 · 6 comments · May be fixed by #48
Assignees
Labels
Stage: Design 🎨 Issues that require design before implementation Stage: Research 🔬 Issues that require research before implementation

Comments

@jenniferjiangkells
Copy link
Member

jenniferjiangkells commented Jun 6, 2024

Add pipeline integrations. Examples usage:

# Import an external pipeline libraries
pipeline = BasePipeline()
pipeline.from_spacy('/path')

# how do?
pipeline.from_hf()
pipeline.from_langchain()

Integrations to start:

  • spacy
  • huggingface
  • langchain
  • sklearn

These libraries all have some form of pipelines. Some time may be needed to think about how to import / export whole pipelines from these libraries.

[Future]

  • haystack
  • llama-index
@jenniferjiangkells jenniferjiangkells converted this from a draft issue Jun 6, 2024
@jenniferjiangkells jenniferjiangkells added the Issue: Feature Request ✨ New feature or improvement to existing feature label Jun 6, 2024
@jenniferjiangkells jenniferjiangkells changed the title Add AI integrations Add AI library integrations Jun 7, 2024
@jenniferjiangkells jenniferjiangkells changed the title Add AI library integrations Add LLM/NLP library integrations Jun 9, 2024
@jenniferjiangkells jenniferjiangkells linked a pull request Jun 20, 2024 that will close this issue
@jenniferjiangkells jenniferjiangkells moved this from Ready to In progress in HealthChain Jun 20, 2024
@jenniferjiangkells jenniferjiangkells added the Component: Framework Issue/PR that addresses core framework functionality label Sep 16, 2024
@jenniferjiangkells
Copy link
Member Author

Might need to update this to fit with the updated roadmap.

@jenniferjiangkells jenniferjiangkells changed the title Add LLM/NLP library integrations Add pipeline integrations Sep 20, 2024
@jenniferjiangkells jenniferjiangkells changed the title Add pipeline integrations Research and design pipeline integration methods Sep 20, 2024
@jenniferjiangkells jenniferjiangkells added Stage: Design 🎨 Issues that require design before implementation and removed Issue: Feature Request ✨ New feature or improvement to existing feature Component: Framework Issue/PR that addresses core framework functionality labels Sep 20, 2024
@jenniferjiangkells
Copy link
Member Author

@adamkells I updated the description for this issue, are you up for taking this

@adamkells
Copy link
Contributor

Yeah I can take this. I think the updated version of the issue is much cleaner than previously.

@jenniferjiangkells jenniferjiangkells added the Stage: Research 🔬 Issues that require research before implementation label Oct 3, 2024
@adamkells
Copy link
Contributor

Summary of Findings

I looked at the four packages suggested (sklearn, spaCy, langchain, hugging-face).

These package all introduce a concept of pipelines. However each is slightly different.

  • sklearn: A pipeline has fit + predict/transform methods so it can be applied flexibly to input data of an expected form and return model predictions.
  • spaCy: Converts a text input to a Document and then applies operations which add metadata to the Document.
  • hugging-face: Similar to spaCy but has more variety in pipelines for different tasks which means it has more flexible pipeline outputs (e.g. a list of dictionaries containing predictions).
  • langchain: Doesn't have a direct concept of a pipeline but instead has an LLMChain which takes an llm and a prompt and passes it through a chain of tasks to provide a flexible output (usually text).

If I had to rank these in terms of how neatly they fit into our framework:

  1. spaCy: Almost identical in concept and naming convention to our existing pipelines.
  2. hugging-face: Very similar but as it can handle a wider range of tasks it will need to be restricted to a narrower list of tasks which fit our framework.
  3. langchain: Task and language is a bit different but can be wrapped in a way that makes it compatible with ours.
  4. sklearn: Very different as it is by far the most flexible and is not typically applied to nlp tasks. I also think this is relatively easy for an experienced sklearn user to integrate on their own.

Proposal

To add methods to instantiate pipelines from spaCy, langchain and hugging-face and to omit sklearn for the moment.

The api can be as simple as from_spacy etc and to check that the inputs and outputs of the pipeline conform to our constraints. We can throw errors like from_hf expects a pipeline which accepts text as input and returns either string or list etc.

@jenniferjiangkells
Copy link
Member Author

This is great 🌟 Let's focus on text pipelines and park the others for future. Are you happy to also create an issue and work on the implementation @adamkells ?

Also what did you mean by:

I also think this is relatively easy for an experienced sklearn user to integrate on their own.

@adamkells
Copy link
Contributor

Yeah happy to work on it.

With sklearn, I just meant that the components of an sklearn pipeline which a user may want to use make more sense to be just added manually as a part of the sandbox than to be an initialiser for the healthchain pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stage: Design 🎨 Issues that require design before implementation Stage: Research 🔬 Issues that require research before implementation
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants