NER - Results Inconsistence Between Hugging Face Model and SparkNLP model converted #14235

danjmp · 2024-04-15T14:27:35Z

danjmp
Apr 15, 2024

Hello,

I am trying to convert a hugging face model (pre-trained) into a SparkNLP model

I am following the notebook available in JohnSnowLabs Repo (https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/HuggingFace%20in%20Spark%20NLP%20-%20DistilBertForTokenClassification.ipynb#scrollTo=D1gv1yxX2lNL)

I am using the same model, basically I am running their notebook with no modification (the same libraries versions)

The string I am using to test is 100-Bank of America. My purpose is to check how the model behave when there is 100-##, where I want it to label as ORG.

The HuggingFace model give me the expected results

For the python model the code can be seen below

from transformers import AutoTokenizer, AutoModelForTokenClassification

from transformers import pipeline

MODEL_NAME = "elastic/distilbert-base-cased-finetuned-conll03-english"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = "100-Bank of America"

ner_results = nlp(example)

The ner_resultsare

[{'entity': 'B-ORG', 'score': 0.9992428, 'index': 3, 'word': 'Bank', 'start': 4, 'end': 8}, {'entity': 'I-ORG', 'score': 0.99944407, 'index': 4, 'word': 'of', 'start': 9, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99941075, 'index': 5, 'word': 'America', 'start': 12, 'end': 19}]

As expected it labels as ORG

However the same model converted to Spark using SparkNLP gives the following result (following exactly the notebook mentioned above)

It is not able to separate 100- from Bank and the only chunk detected is America

In my understanding it should give me the same results as the model/tokenizer are the same ones from hugging face. Someone has faced these problem??

Thanks

maziyarpanahi · 2024-04-16T09:31:04Z

maziyarpanahi
Apr 16, 2024
Maintainer

They should be identical. Couple of things could contribute:

The scores can be a little different due to precision of Floats being different in Java, so those labels might have failed the threshold
the model is cased, but caseSensitive param is not set to True correctly in Spark NLP

However, my money is on this:

Tokenization! In Spark NLP we have a true Tokenization (Tokenizer) before we use the dedicated tokenization the model offers (SentencePiece, BPE, etc.) I can see there is - so I suspect it got tokenized as a natural language.

Could you try this and see what happens?

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setToLowercase(False)

https://sparknlp.org/api/python/reference/autosummary/sparknlp/annotator/token/regex_tokenizer/index.html#sparknlp.annotator.token.regex_tokenizer.RegexTokenizer

2 replies

danjmp Apr 16, 2024
Author

Hello, thanks so much for your answer @maziyarpanahi

As you mentioned I have been analyzing the tokenization and I would also bet my money on this.

Before going to the model, the tokenization from HuggingFace generates multiple files that I save it locally

MODEL_NAME = 'elastic/distilbert-base-cased-finetuned-conll03-english

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

However from the files below I just used vocab.txt

What is exactly the purpose of these file? Because in my mind that was the main file from the tokenizer (HuggingFace), as it has the all the tokens list, however the other files were important as well, and why the other ones are not used??

If I see the vocab list, I see that multiple tokens are related to the case where you have specific characters after ##, that I suppose it should be splitted as well

I suppose it is here where the Tokenizer from SparkNLP is not performing the same as HuggingFace

I have used you example and it keeps tokenizing the same way as before

import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
text = '100-Bank of America'
data = spark.createDataFrame([[text]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setToLowercase(False) \

pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
result = pipeline.transform(data)
result.selectExpr("token.result").show(truncate=False)

The code above gives this

How can I use the dedicated tokenization the model offers?

I do not see how I can convert the Tokenization logic used in HuggingFace into a RegexTokenizer from SparkNLP :/

danjmp Apr 22, 2024
Author

Hello @maziyarpanahi ,

How could I use de dedicated tokenizer from the model instead the Tokenizer() from SparkNLP?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER - Results Inconsistence Between Hugging Face Model and SparkNLP model converted #14235

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

NER - Results Inconsistence Between Hugging Face Model and SparkNLP model converted #14235

danjmp Apr 15, 2024

Replies: 1 comment · 2 replies

maziyarpanahi Apr 16, 2024 Maintainer

danjmp Apr 16, 2024 Author

danjmp Apr 22, 2024 Author

danjmp
Apr 15, 2024

Replies: 1 comment 2 replies

maziyarpanahi
Apr 16, 2024
Maintainer

danjmp Apr 16, 2024
Author

danjmp Apr 22, 2024
Author