Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Johnsnowlabs 5.1.8 #765

Merged
merged 14 commits into from
Nov 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/_data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,10 @@ jsl:
url: /docs/en/jsl/aws-emr-utils
- title: Utilities for AWS Glue
url: /docs/en/jsl/aws-glue-utils
- title: Utilities for Haystack
url: /docs/en/jsl/haystack-utils
- title: Utilities for Langchain
url: /docs/en/jsl/langchain-utils
- title: Release Testing Utilities
url: /docs/en/jsl/testing-utils
- title: Module Structure
Expand Down
70 changes: 38 additions & 32 deletions docs/en/jsl/databricks_utils.md

Large diffs are not rendered by default.

66 changes: 66 additions & 0 deletions docs/en/jsl/haystack_utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
layout: docs
seotitle: NLP | John Snow Labs
title: Utilities for Haystack
permalink: /docs/en/jsl/haystack-utils
key: docs-install
modify_date: "2020-05-26"
header: true
show_nav: true
sidebar:
nav: jsl
---

<div class="main-docs" markdown="1">


Johnsnowlabs provides the following nodes which can be used inside the [Haystack Framework](https://haystack.deepset.ai/) for scalable pre-processing&embedding on
[spark clusters](https://spark.apache.org/). With this you can create Easy-Scalable&Production-Grade LLM&RAG applications.
See the [Haystack with Johnsnowlabs Tutorial Notebook](https://github.com/JohnSnowLabs/johnsnowlabs/blob/release/master/notebooks/haystack_with_johnsnowlabs.ipynb)

## JohnSnowLabsHaystackProcessor
Pre-Process you documents in a scalable fashion in Haystack
based on [Spark-NLP's DocumentCharacterTextSplitter](https://sparknlp.org/docs/en/annotators#documentcharactertextsplitter) and supports all of it's [parameters](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/document_character_text_splitter/index.html#sparknlp.annotator.document_character_text_splitter.DocumentCharacterTextSplitter)

```python
# Create Pre-Processor which is connected to spark-cluster
from johnsnowlabs.llm import embedding_retrieval
processor = embedding_retrieval.JohnSnowLabsHaystackProcessor(
chunk_overlap=2,
chunk_size=20,
explode_splits=True,
keep_seperators=True,
patterns_are_regex=False,
split_patterns=["\n\n", "\n", " ", ""],
trim_whitespace=True,
)
# Process document distributed on a spark-cluster
processor.process(some_documents)
```

## JohnSnowLabsHaystackEmbedder
Scalable Embedding computation with [any Sentence Embedding](https://nlp.johnsnowlabs.com/models?task=Embeddings) from John Snow Labs in Haystack
You must provide the **NLU reference** of a sentence embeddings to load it.
If you want to use GPU with the Embedding Model, set GPU=True on localhost, it will start a spark-session with GPU jars.
For clusters, you must setup cluster-env correctly, using [nlp.install_to_databricks()](https://nlp.johnsnowlabs.com/docs/en/jsl/install_advanced#into-a-freshly-created-databricks-cluster-automatically) is recommended.

```python
from johnsnowlabs.llm import embedding_retrieval
from haystack.document_stores import InMemoryDocumentStore

# Write some processed data to Doc store, so we can retrieve it later
document_store = InMemoryDocumentStore(embedding_dim=512)
document_store.write_documents(some_documents)

# Create Embedder which connects is connected to spark-cluster
retriever = embedding_retrieval.JohnSnowLabsHaystackEmbedder(
embedding_model='en.embed_sentence.bert_base_uncased',
document_store=document_store,
use_gpu=False,
)

# Compute Embeddings distributed in a cluster
document_store.update_embeddings(retriever)

```
</div>
6 changes: 5 additions & 1 deletion docs/en/jsl/install_advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,12 @@ Where to find your Databricks Access Token:
You can set the following parameters on the `nlp.install()` function to define properties of the cluster which will be created.
See [Databricks Cluster Creation](https://docs.databricks.com/dev-tools/api/latest/clusters.html#create) for a detailed description of all parameters.

You can use the `extra_pip_installs` parameter to installl a list of additional pypi libraries to the cluster.
Just set `nlp.install_to_databricks(extra_pip_installs=['langchain','farm-haystack==1.2.3'])` to install the libraries.

| Cluster creation Parameter | Default Value |
|----------------------------|--------------------------------------------|
| extra_pip_installs | `None` |
| block_till_cluster_ready | `True` |
| num_workers | `1` |
| cluster_name | `John-Snow-Labs-Databricks-Auto-Cluster🚀` |
Expand Down Expand Up @@ -390,7 +394,7 @@ Your can get it from:

``` python
# Create a new Cluster with Spark NLP and all licensed libraries ready to go:
nlp.install(databricks_host='https://your_host.cloud.databricks.com', databricks_token = 'dbapi_token123',)
nlp.install_to_databricks(databricks_host='https://your_host.cloud.databricks.com', databricks_token = 'dbapi_token123',)
```
</div><div class="h3-box" markdown="1">

Expand Down
20 changes: 19 additions & 1 deletion docs/en/jsl/jsl_release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,25 @@ sidebar:

<div class="main-docs" markdown="1">

See [Github Releases](https://github.com/JohnSnowLabs/johnsnowlabs/releases) for detailed information on Release History and Featuresasdas
See [Github Releases](https://github.com/JohnSnowLabs/johnsnowlabs/releases) for detailed information on Release History and Features


## 5.1.8
Release date: 17-11-2023

The John Snow Labs 5.1.8 Library released with the following pre-installed and recommended dependencies


| Library | Version |
|-----------------------------------------------------------------------------------------|---------|
| [Visual NLP](https://nlp.johnsnowlabs.com/docs/en/spark_ocr_versions/ocr_release_notes) | `5.0.2` |
| [Enterprise NLP](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators) | `5.1.3` |
| [Finance NLP](https://nlp.johnsnowlabs.com/docs/en/financial_release_notes) | `1.X.X` |
| [Legal NLP](https://nlp.johnsnowlabs.com/docs/en/legal_release_notes) | `1.X.X` |
| [NLU](https://github.com/JohnSnowLabs/nlu/releases) | `5.1.0` |
| [Spark-NLP-Display](https://sparknlp.org/docs/en/display) | `4.4` |
| [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp/releases/) | `5.1.4` |
| [Pyspark](https://spark.apache.org/docs/latest/api/python/) | `3.1.2` |

## 5.1.7
Release date: 19-10-2023
Expand Down
94 changes: 94 additions & 0 deletions docs/en/jsl/langchain_utils.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
layout: docs
seotitle: NLP | John Snow Labs
title: Utilities for Langchain
permalink: /docs/en/jsl/langchain-utils
key: docs-install
modify_date: "2020-05-26"
header: true
show_nav: true
sidebar:
nav: jsl
---

<div class="main-docs" markdown="1">





Johnsnowlabs provides the following components which can be used inside the [Langchain Framework](https://www.langchain.com/) for scalable pre-processing&embedding on
[spark clusters](https://spark.apache.org/) as Agent Tools and Pipeline components. With this you can create Easy-Scalable&Production-Grade LLM&RAG applications.
See the [Langchain with Johnsnowlabs Tutorial Notebook](https://github.com/JohnSnowLabs/johnsnowlabs/blob/release/master/notebooks/langchain_with_johnsnowlabs.ipynb)

## JohnSnowLabsHaystackProcessor
Pre-Process you documents in a scalable fashion in Langchain
based on [Spark-NLP's DocumentCharacterTextSplitter](https://sparknlp.org/docs/en/annotators#documentcharactertextsplitter) and supports all of it's [parameters](https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/document_character_text_splitter/index.html#sparknlp.annotator.document_character_text_splitter.DocumentCharacterTextSplitter)

```python
from langchain.document_loaders import TextLoader
from johnsnowlabs.llm import embedding_retrieval

loader = TextLoader('/content/state_of_the_union.txt')
documents = loader.load()


from johnsnowlabs.llm import embedding_retrieval

# Create Pre-Processor which is connected to spark-cluster
processor = embedding_retrieval.JohnSnowLabsLangChainCharSplitter(
chunk_overlap=2,
chunk_size=20,
explode_splits=True,
keep_seperators=True,
patterns_are_regex=False,
split_patterns=["\n\n", "\n", " ", ""],
trim_whitespace=True,
)
# Process document distributed on a spark-cluster
pre_processed_docs = jsl_splitter.split_documents(documents)

```

## JohnSnowLabsHaystackEmbedder
Scalable Embedding computation with [any Sentence Embedding](https://nlp.johnsnowlabs.com/models?task=Embeddings) from John Snow Labs.
You must provide the **NLU reference** of a sentence embeddings to load it.
You can start a spark session by setting `hardware_target` as one of `cpu`, `gpu`, `apple_silicon`, or `aarch` on localhost environments.
For clusters, you must setup the cluster-env correctly, using [nlp.install_to_databricks()](https://nlp.johnsnowlabs.com/docs/en/jsl/install_advanced#into-a-freshly-created-databricks-cluster-automatically) is recommended.

```python
# Create Embedder which connects is connected to spark-cluster
from johnsnowlabs.llm import embedding_retrieval
embeddings = embedding_retrieval.JohnSnowLabsLangChainEmbedder('en.embed_sentence.bert_base_uncased',hardware_target='cpu')

# Compute Embeddings distributed
from langchain.vectorstores import FAISS
retriever = FAISS.from_documents(pre_processed_docs, embeddings).as_retriever()

# Create A tool
from langchain.agents.agent_toolkits import create_retriever_tool
tool = create_retriever_tool(
retriever,
"search_state_of_union",
"Searches and returns documents regarding the state-of-the-union."
)


# Use Create LLM Agent with the Tool
from langchain.agents.agent_toolkits import create_conversational_retrieval_agent
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(openai_api_key='YOUR_API_KEY')
agent_executor = create_conversational_retrieval_agent(llm, [tool], verbose=True)
result = agent_executor({"input": "what did the president say about going to east of Columbus?"})
result['output']

>>>
> Entering new AgentExecutor chain...
Invoking: `search_state_of_union` with `{'query': 'going to east of Columbus'}`
[Document(page_content='miles east of', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='in America.', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='out of America.', metadata={'source': '/content/state_of_the_union.txt'}), Document(page_content='upside down.', metadata={'source': '/content/state_of_the_union.txt'})]I'm sorry, but I couldn't find any specific information about the president's statement regarding going to the east of Columbus in the State of the Union address.
> Finished chain.
I'm sorry, but I couldn't find any specific information about the president's statement regarding going to the east of Columbus in the State of the Union address.
```


</div>
2 changes: 2 additions & 0 deletions johnsnowlabs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
if try_import_lib("sparkocr") and try_import_lib("sparknlp"):
from johnsnowlabs import visual

from johnsnowlabs import llm


def new_version_online():
from .utils.pip_utils import get_latest_lib_version_on_pypi
Expand Down
Loading