Skip to content

Spark NLP 6.0.3 #14600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
=======
6.0.1
6.0.3
=======
----------------
New Features & Enhancements
----------------
* Introducing E5-V Universal Embeddings (SPARKNLP-1143)
* Enhanced Chunking Strategies (SPARKNLP-1125)
* New XML Reader (SPARKNLP-1119)

----------------
Bug Fixes
----------------
* Fixed typo for Excel reader notebook

=======
6.0.2
=======
----------------
New Features & Enhancements
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==6.0.2 pyspark==3.3.1
$ pip install spark-nlp==6.0.3 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -129,7 +129,7 @@ For a quick example of using pipelines and models take a look at our official [d

### Apache Spark Support

Spark NLP *6.0.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
Spark NLP *6.0.3* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x

| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand Down Expand Up @@ -159,7 +159,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http

### Databricks Support

Spark NLP 6.0.2 has been tested and is compatible with the following runtimes:
Spark NLP 6.0.3 has been tested and is compatible with the following runtimes:

| **CPU** | **GPU** |
|--------------------|--------------------|
Expand All @@ -176,7 +176,7 @@ We are compatible with older runtimes. For a full list check databricks support

### EMR Support

Spark NLP 6.0.2 has been tested and is compatible with the following EMR releases:
Spark NLP 6.0.3 has been tested and is compatible with the following EMR releases:

| **EMR Release** |
|--------------------|
Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "6.0.2"
version := "6.0.3"

(ThisBuild / scalaVersion) := scalaVer

Expand Down
4 changes: 2 additions & 2 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
{% set name = "spark-nlp" %}
{% set version = "6.0.2" %}
{% set version = "6.0.3" %}

package:
name: {{ name|lower }}
version: {{ version }}

source:
url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/spark_nlp-{{ version }}.tar.gz
sha256: 8b97358206809a123076bcd58aba3b6487086c95c6370be6a9a34f0d5568b43d
sha256: ff09f27c512401cff1ec3af572069b2e2af35b87a0f6737c5340538bac10faf7

build:
noarch: python
Expand Down
133 changes: 133 additions & 0 deletions docs/en/transformer_entries/E5VEmbeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
{%- capture title -%}
E5VEmbeddings
{%- endcapture -%}

{%- capture description -%}
Universal multimodal embeddings using E5-V.

E5-V is a multimodal embedding model that bridges the modality gap between text and images, enabling strong performance in cross-modal retrieval, classification, clustering, and more. It supports both image+text and text-only embedding scenarios, and is fine-tuned from lmms-lab/llama3-llava-next-8b. The default model is `"e5v_int4"`.

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val embeddings = E5VEmbeddings.pretrained()
.setInputCols("image_assembler")
.setOutputCol("e5v")
```

For available pretrained models please see the
[Models Hub](https://sparknlp.org/models?q=E5V).

For extended examples of usage, see
[E5VEmbeddingsTestSpec](https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/E5VEmbeddingsTestSpec.scala).

**Sources** :

- [E5-V: Universal Embeddings with Multimodal Large Language Models (arXiv)](https://arxiv.org/abs/2407.12580)
- [Hugging Face Model Card](https://huggingface.co/royokong/e5-v)
- [E5-V Github Repository](https://github.com/kongds/E5-V)
{%- endcapture -%}

{%- capture input_anno -%}
IMAGE
{%- endcapture -%}

{%- capture output_anno -%}
SENTENCE_EMBEDDINGS
{%- endcapture -%}

{%- capture python_example -%}
# Image + Text Embedding
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql.functions import lit

image_df = spark.read.format("image").option("dropInvalid", True).load(imageFolder)
imagePrompt = "<|start_header_id|>user<|end_header_id|>\n\n<image>\\nSummary above image in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n"
test_df = image_df.withColumn("text", lit(imagePrompt))
imageAssembler = ImageAssembler() \
.setInputCol("image") \
.setOutputCol("image_assembler")
e5vEmbeddings = E5VEmbeddings.pretrained() \
.setInputCols(["image_assembler"]) \
.setOutputCol("e5v")
pipeline = Pipeline().setStages([
imageAssembler,
e5vEmbeddings
])
result = pipeline.fit(test_df).transform(test_df)
result.select("e5v.embeddings").show(truncate=False)

# Text-Only Embedding
from sparknlp.util import EmbeddingsDataFrameUtils
textPrompt = "<|start_header_id|>user<|end_header_id|>\n\n<sent>\\nSummary above sentence in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n"
textDesc = "A cat sitting in a box."
nullImageDF = spark.createDataFrame(
spark.sparkContext.parallelize([EmbeddingsDataFrameUtils.emptyImageRow]),
EmbeddingsDataFrameUtils.imageSchema)
textDF = nullImageDF.withColumn("text", lit(textPrompt.replace("<sent>", textDesc)))
e5vEmbeddings = E5VEmbeddings.pretrained() \
.setInputCols(["image"]) \
.setOutputCol("e5v")
result = e5vEmbeddings.transform(textDF)
result.select("e5v.embeddings").show(truncate=False)
{%- endcapture -%}

{%- capture scala_example -%}
// Image + Text Embedding
import org.apache.spark.sql.functions.lit
import com.johnsnowlabs.nlp.base.ImageAssembler
import com.johnsnowlabs.nlp.embeddings.E5VEmbeddings
import org.apache.spark.ml.Pipeline

val imageDF = spark.read.format("image").option("dropInvalid", value = true).load(imageFolder)
val imagePrompt = "<|start_header_id|>user<|end_header_id|>\n\n<image>\\nSummary above image in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n"
val testDF = imageDF.withColumn("text", lit(imagePrompt))
val imageAssembler = new ImageAssembler().setInputCol("image").setOutputCol("image_assembler")
val e5vEmbeddings = E5VEmbeddings.pretrained()
.setInputCols("image_assembler")
.setOutputCol("e5v")
val pipeline = new Pipeline().setStages(Array(imageAssembler, e5vEmbeddings))
val result = pipeline.fit(testDF).transform(testDF)
result.select("e5v.embeddings").show(truncate = false)

// Text-Only Embedding
import com.johnsnowlabs.nlp.util.EmbeddingsDataFrameUtils.{emptyImageRow, imageSchema}
val textPrompt = "<|start_header_id|>user<|end_header_id|>\n\n<sent>\\nSummary above sentence in one word: <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n"
val textDesc = "A cat sitting in a box."
val nullImageDF = spark.createDataFrame(spark.sparkContext.parallelize(Seq(emptyImageRow)), imageSchema)
val textDF = nullImageDF.withColumn("text", lit(textPrompt.replace("<sent>", textDesc)))
val e5vEmbeddings = E5VEmbeddings.pretrained()
.setInputCols("image")
.setOutputCol("e5v")
val result2 = e5vEmbeddings.transform(textDF)
result2.select("e5v.embeddings").show(truncate = false)
{%- endcapture -%}

{%- capture api_link -%}
[E5VEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/E5VEmbeddings)
{%- endcapture -%}

{%- capture python_api_link -%}
[E5VEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/cv/e5v_embeddings/index.html#sparknlp.annotator.cv.e5v_embeddings.E5VEmbeddings)
{%- endcapture -%}

{%- capture source_link -%}
[E5VEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/E5VEmbeddings.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
Loading