No alignment for [UNK] tokens #12023
-
I have been using the alignment with great success in the import spacy
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG
# use an example transformer
DEFAULT_CONFIG["transformer"]["model"]["name"] = "bert-base-uncased"
nlp_blank = spacy.blank("en")
trf = nlp.add_pipe("transformer", config=DEFAULT_CONFIG["transformer"])
trf.initialize(get_examples=lambda: [])
# forward pass
doc = nlp("My name is Wolfgang 🚀 and I live in Berlin.")
# inspect model output for '🚀'
doc._.trf_data.align[4]
# Ragged(data=array([], shape=(0, 0), dtype=int32), lengths=array([0], dtype=int32), data_shape=(-1, 1), starts_ends=None)
doc._.trf_data.wordpieces.strings[0][5] # +1 due to [CLS]
# '[UNK]'
doc._.trf_data.wordpieces
# WordpieceBatch(
# strings=[['[CLS]', 'my', 'name', 'is', 'wolfgang', '[UNK]', 'and',
# 'i', 'live', 'in', 'berlin', '.', '[SEP]']],
# input_ids=array([[ 101, 2026, 2171, 2003, 13865, 100, 1998, 1045, 2444,
# 1999, 4068, 1012, 102]], dtype=int32),
# attention_mask=array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]],
# dtype=float32),
# lengths=[13],
# token_type_ids=array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
# ) Naturally, one solution is simply to ignore the UNK tokens, however as seen here the UNK token embedding might contain useful information about the token 🚀. I am wondering if there is a principled way to match up the UNK tokens? Naturally in this case you can estimate it (tokens 3 and 5 match to wordpiece embedding 4 and 6 så token 4 must match up with embedding 5), but that break down even in simple cases such as repeating UNK tokens. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
This is a weird case, thanks for bringing it to our attention. At first I had trouble understanding your issue since the number of tokens matched, but you're right that if you use token lengths the aligment is off. On the other hand, the information to get a proper alignment still seems to be present. Consider this code: import spacy
from spacy_transformers import Transformer
from spacy_transformers.pipeline_component import DEFAULT_CONFIG
# use an example transformer
DEFAULT_CONFIG["transformer"]["model"]["name"] = "bert-base-uncased"
nlp = spacy.blank("en")
trf = nlp.add_pipe("transformer", config=DEFAULT_CONFIG["transformer"])
trf.initialize(get_examples=lambda: [])
# forward pass
doc = nlp("My name is Wolfgang 🚀🚀 and I live in Berlin.")
# inspect model output for '🚀'
trf_data = doc._.trf_data
strings = trf_data.wordpieces.strings[0]
offset = 1
for tok in doc:
wplen = trf_data.align.lengths[tok.i]
offset_based = strings[offset : offset + wplen]
align_idx = 0
if trf_data.align[tok.i].data:
align_idx = trf_data.align[tok.i].data[0][0]
align_based = strings[align_idx : align_idx + wplen]
print(tok.i, tok, offset_based, align_based, sep="\t")
offset += wplen Output:
Here you see that using just offsets things get weird, but if you use the start values from |
Beta Was this translation helpful? Give feedback.
-
I think it's that If you are feeling adventurous, you can try installing |
Beta Was this translation helpful? Give feedback.
-
Thanks for the responses @adrianeboyd and @polm. Switching to the newest version of spacy-transformer does indeed resolve the problem.
This leads to the intended output: import spacy
import spacy_wrap
nlp = spacy.blank("en")
# specify model from the hf hub
config = {"model": {"name": "dslim/bert-base-NER"}}
# add it to the pipe
nlp.add_pipe("token_classification_transformer", config=config)
# test it on two samples
doc = nlp("My name is Wolfgang 🚀 and I live in Berlin.")
# [('Wolfgang 🚀', 'PER'), ('Berlin', 'LOC')]
print([(ent.text, ent.label_) for ent in doc.ents])
doc = nlp("My name is Wolfgang 🚀🚀 🚀 and I live in Berlin.")
print([(ent.text, ent.label_) for ent in doc.ents])
# [('Wolfgang 🚀🚀 🚀', 'PER'), ('Berlin', 'LOC')] Thanks for the help on this! |
Beta Was this translation helpful? Give feedback.
I think it's that
spacy-alignments
doesn't align the string[UNK]
to the string🚀
.If you are feeling adventurous, you can try installing
spacy-transformers
frommaster
(should currently be v1.2.0.dev0), which has been updated to use the alignments directly from fast tokenizers instead of the alignments fromspacy-alignments
. (spacy-alignments
is still used for slow tokenizers.)