Replies: 1 comment 2 replies
-
They should be identical. Couple of things could contribute:
However, my money is on this:
Could you try this and see what happens? regexTokenizer = RegexTokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token") \
.setToLowercase(False) |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I am trying to convert a hugging face model (pre-trained) into a SparkNLP model
I am following the notebook available in JohnSnowLabs Repo (https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/transformers/HuggingFace%20in%20Spark%20NLP%20-%20DistilBertForTokenClassification.ipynb#scrollTo=D1gv1yxX2lNL)
I am using the same model, basically I am running their notebook with no modification (the same libraries versions)
The string I am using to test is
100-Bank of America
. My purpose is to check how the model behave when there is100-##
, where I want it to label asORG
.The HuggingFace model give me the expected results
For the python model the code can be seen below
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
MODEL_NAME = "elastic/distilbert-base-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "100-Bank of America"
ner_results = nlp(example)
The
ner_results
are[{'entity': 'B-ORG', 'score': 0.9992428, 'index': 3, 'word': 'Bank', 'start': 4, 'end': 8}, {'entity': 'I-ORG', 'score': 0.99944407, 'index': 4, 'word': 'of', 'start': 9, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99941075, 'index': 5, 'word': 'America', 'start': 12, 'end': 19}]
As expected it labels as ORG
However the same model converted to Spark using SparkNLP gives the following result (following exactly the notebook mentioned above)
It is not able to separate
100-
fromBank
and the only chunk detected isAmerica
In my understanding it should give me the same results as the model/tokenizer are the same ones from hugging face. Someone has faced these problem??
Thanks
Beta Was this translation helpful? Give feedback.
All reactions